Problem Statement

Context

AllLife Bank is a US bank that has a growing customer base. The majority of these customers are liability customers (depositors) with varying sizes of deposits. The number of customers who are also borrowers (asset customers) is quite small, and the bank is interested in expanding this base rapidly to bring in more loan business and in the process, earn more through the interest on loans. In particular, the management wants to explore ways of converting its liability customers to personal loan customers (while retaining them as depositors).

A campaign that the bank ran last year for liability customers showed a healthy conversion rate of over 9% success. This has encouraged the retail marketing department to devise campaigns with better target marketing to increase the success ratio.

You as a Data scientist at AllLife bank have to build a model that will help the marketing department to identify the potential customers who have a higher probability of purchasing the loan.

Objective

To predict whether a liability customer will buy personal loans, to understand which customer attributes are most significant in driving purchases, and identify which segment of customers to target more.

Data Dictionary

  • ID: Customer ID
  • Age: Customer’s age in completed years
  • Experience: #years of professional experience
  • Income: Annual income of the customer (in thousand dollars)
  • ZIP Code: Home Address ZIP code.
  • Family: the Family size of the customer
  • CCAvg: Average spending on credit cards per month (in thousand dollars)
  • Education: Education Level. 1: Undergrad; 2: Graduate;3: Advanced/Professional
  • Mortgage: Value of house mortgage if any. (in thousand dollars)
  • Personal_Loan: Did this customer accept the personal loan offered in the last campaign? (0: No, 1: Yes)
  • Securities_Account: Does the customer have securities account with the bank? (0: No, 1: Yes)
  • CD_Account: Does the customer have a certificate of deposit (CD) account with the bank? (0: No, 1: Yes)
  • Online: Do customers use internet banking facilities? (0: No, 1: Yes)
  • CreditCard: Does the customer use a credit card issued by any other Bank (excluding All life Bank)? (0: No, 1: Yes)

Importing necessary libraries

In [ ]:
# Installing the libraries with the specified version.
!pip install numpy==1.25.2 pandas==1.5.3 matplotlib==3.7.1 seaborn==0.13.1 scikit-learn==1.2.2 sklearn-pandas==2.2.0 -q --user

Note:

  1. After running the above cell, kindly restart the notebook kernel (for Jupyter Notebook) or runtime (for Google Colab), write the relevant code for the project from the next cell, and run all cells sequentially from the next cell.

  2. On executing the above line of code, you might see a warning regarding package dependencies. This error message can be ignored as the above code ensures that all necessary libraries and their dependencies are maintained to successfully execute the code in this notebook.

In [ ]:
# Libraries to help with reading and manipulating data.
import pandas as pd
import numpy as np

# libaries to help with data visualization.
import matplotlib.pyplot as plt
import seaborn as sns

# Libraries to help with data preprocessing.
import matplotlib.patches as mpatches
from matplotlib.ticker import FuncFormatter

# Removes the limit for the number of displayed columns.
pd.set_option("display.max_columns", None)
# Sets the limit for the number of displayed rows.
pd.set_option("display.max_rows", 200)

# Library to split data.
from sklearn.model_selection import train_test_split

# To build model for prediction.
from sklearn.tree import DecisionTreeClassifier
from sklearn import tree

# To tune different models.
from sklearn.model_selection import GridSearchCV

# To get diferent metric scores.
from sklearn.metrics import (
    f1_score,
    accuracy_score,
    recall_score,
    precision_score,
    confusion_matrix,
    make_scorer,
    classification_report,
)

# Library to suppress warnings (FutureWarning).
import warnings
warnings.filterwarnings("ignore", category= FutureWarning)

Loading the dataset

In [ ]:
# Read the Loan Modelling dataset from the google drive.
from google.colab import drive
drive.mount('/content/drive')

# assign the dataset to a dataframe - personal_loan_df.

personal_loan_df_original = pd.read_csv('/content/drive/MyDrive/Loan_Modelling.csv')

# Create a copy of the original DataFrame to avoid modifying the original data.
personal_loan_df = personal_loan_df_original.copy() # Changed 'personal_load_df_original' to 'personal_loan_df_original'
Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).

Data Overview

  • Observations
  • Sanity checks
In [ ]:
# Return the first 5 rows of the dataset.
personal_loan_df.head(5)
Out[ ]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
0 1 25 1 49 91107 4 1.6 1 0 0 1 0 0 0
1 2 45 19 34 90089 3 1.5 1 0 0 1 0 0 0
2 3 39 15 11 94720 1 1.0 1 0 0 0 0 0 0
3 4 35 9 100 94112 1 2.7 2 0 0 0 0 0 0
4 5 35 8 45 91330 4 1.0 2 0 0 0 0 0 1
In [ ]:
# Return the last 5 rows of the dataset.
personal_loan_df.tail(5)
Out[ ]:
ID Age Experience Income ZIPCode Family CCAvg Education Mortgage Personal_Loan Securities_Account CD_Account Online CreditCard
4995 4996 29 3 40 92697 1 1.9 3 0 0 0 0 1 0
4996 4997 30 4 15 92037 4 0.4 1 85 0 0 0 1 0
4997 4998 63 39 24 93023 2 0.3 3 0 0 0 0 0 0
4998 4999 65 40 49 90034 3 0.5 2 0 0 0 0 1 0
4999 5000 28 4 83 92612 3 0.8 1 0 0 0 0 1 1
In [ ]:
# Display the number of rows and columns in the dataset.
rows, columns = personal_loan_df.shape

# Print the number of rows and columns from the dataset.
print(f'Number of Rows: {rows:,}')
print(f'Number of Columns: {columns:,}')
Number of Rows: 5,000
Number of Columns: 14
In [ ]:
# Remove any unnecessary columns, but only if they exist
if 'ID' in personal_loan_df.columns:
    personal_loan_df.drop(["ID"], axis=1, inplace=True)
In [ ]:
# Display summary information incl. the data types in the DataFrame.
personal_loan_df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 5000 entries, 0 to 4999
Data columns (total 13 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Age                 5000 non-null   int64  
 1   Experience          5000 non-null   int64  
 2   Income              5000 non-null   int64  
 3   ZIPCode             5000 non-null   int64  
 4   Family              5000 non-null   int64  
 5   CCAvg               5000 non-null   float64
 6   Education           5000 non-null   int64  
 7   Mortgage            5000 non-null   int64  
 8   Personal_Loan       5000 non-null   int64  
 9   Securities_Account  5000 non-null   int64  
 10  CD_Account          5000 non-null   int64  
 11  Online              5000 non-null   int64  
 12  CreditCard          5000 non-null   int64  
dtypes: float64(1), int64(12)
memory usage: 507.9 KB

Observations:

  • Data Columns:

    • The dataset now contains 13 columns after dropping the ID column.
    • (Note: Although one part of the report mentioned 14 data columns, the current dataset consists of 13 columns following the removal of the ID column.)
  • Data Types:

    • 12 columns are of integer datatype (int64).
    • Column 6 (CCAvg) is of float datatype (float64).
  • Variable Types:

    • All 13 variables are numerical.
    • There are 0 categorical variables.
  • Observations:

    • Each column has 5000 non-null observations, indicating a complete dataset without missing values.
  • Memory Usage:

    • The DataFrame consumes approximately 507.9 KB of memory.
In [ ]:
# Check for missing/null values in the dataset.
missing_values = personal_loan_df.isnull().sum()

# Output if there are any missing data points in the dataset.
if missing_values.sum() > 0:
    print('There are missing values in the dataset.')
else:
    print('There are no missing data points in the Personal Load dataset.')
There are no missing data points in the Personal Load dataset.
In [ ]:
# Display the statistical summary of the dataset.
personal_loan_df.describe(include="all").T
Out[ ]:
count mean std min 25% 50% 75% max
Age 5000.0 45.338400 11.463166 23.0 35.0 45.0 55.0 67.0
Experience 5000.0 20.104600 11.467954 -3.0 10.0 20.0 30.0 43.0
Income 5000.0 73.774200 46.033729 8.0 39.0 64.0 98.0 224.0
ZIPCode 5000.0 93169.257000 1759.455086 90005.0 91911.0 93437.0 94608.0 96651.0
Family 5000.0 2.396400 1.147663 1.0 1.0 2.0 3.0 4.0
CCAvg 5000.0 1.937938 1.747659 0.0 0.7 1.5 2.5 10.0
Education 5000.0 1.881000 0.839869 1.0 1.0 2.0 3.0 3.0
Mortgage 5000.0 56.498800 101.713802 0.0 0.0 0.0 101.0 635.0
Personal_Loan 5000.0 0.096000 0.294621 0.0 0.0 0.0 0.0 1.0
Securities_Account 5000.0 0.104400 0.305809 0.0 0.0 0.0 0.0 1.0
CD_Account 5000.0 0.060400 0.238250 0.0 0.0 0.0 0.0 1.0
Online 5000.0 0.596800 0.490589 0.0 0.0 1.0 1.0 1.0
CreditCard 5000.0 0.294000 0.455637 0.0 0.0 0.0 1.0 1.0

Observations:

  • Age:

    • The average age of customers is 45 years.
    • The age range spans from 23 to 67 years.
  • Family Size:

    • Customer family sizes vary from 1 to 4 members.
  • Income:

    • The average annual income is approximately \$73.8k.
    • A significant proportion of customers earn between \$75k and \$98k, roughly 33% higher than the average.
    • There is a wide disparity in income, with annual incomes ranging from \$8k (minimum) to \$224k (maximum).
  • Personal Loan Acceptance:

    • The majority of customers did not accept the personal loan offered in the last campaign.
  • Mortgage:

    • The mean mortgage value is \$56.5k, with a standard deviation of \$101k.
    • The standard deviation exceeding the mean indicates a high variability in mortgage amounts, which warrants further investigation.
  • Experience:

    • The minimum value for customer experience is -3, which appears to be incorrect.
    • This anomaly requires further investigation and appropriate imputation treatment.
In [ ]:
# Checking for anomalous values

personal_loan_df["Experience"].unique()
Out[ ]:
array([ 1, 19, 15,  9,  8, 13, 27, 24, 10, 39,  5, 23, 32, 41, 30, 14, 18,
       21, 28, 31, 11, 16, 20, 35,  6, 25,  7, 12, 26, 37, 17,  2, 36, 29,
        3, 22, -1, 34,  0, 38, 40, 33,  4, -2, 42, -3, 43])
In [ ]:
# Checking for experience <0
personal_loan_df[personal_loan_df["Experience"] < 0]["Experience"].unique()
Out[ ]:
array([-1, -2, -3])
In [ ]:
# Correcting the experience values
personal_loan_df["Experience"].replace(-1, 1, inplace=True)
personal_loan_df["Experience"].replace(-2, 2, inplace=True)
personal_loan_df["Experience"].replace(-3, 3, inplace=True)
In [ ]:
# checking the number of uniques in the zip code
personal_loan_df["ZIPCode"].nunique()
Out[ ]:
467

There are 467 unique zip codes

In [ ]:
# present the number of unique occurences and corrspondending data counts in each zip code

zip_code_counts = personal_loan_df['ZIPCode'].value_counts()

print("Unique Zip Codes and their Counts:")
zip_code_counts
Unique Zip Codes and their Counts:
Out[ ]:
ZIPCode
94720 169
94305 127
95616 116
90095 71
93106 57
... ...
96145 1
94087 1
91024 1
93077 1
94598 1

467 rows × 1 columns


Observations:

  • Top ZIP Codes:

    • 94720: 169 occurrences (highest representation; potential key region for targeted analysis or marketing).
    • 94305: 127 occurrences.
    • 95616: 116 occurrences.
  • Top 20 ZIP Codes Overview:

    • Occurrence counts range from 169 (the highest) down to 34.
    • Indicates several areas have a relatively high number of individuals.
  • Geographical Concentration:

    • High frequencies in specific ZIP codes, such as 94720, 94305, and 95616, suggest that these regions have significant concentrations of individuals.
    • These clusters could be important focal points for further analysis.
  • Potential Regional Bias:

    • The data shows a bias toward certain ZIP codes, which might result from the sampling method or data collection practices.
    • This bias should be carefully considered when making generalizations or building models.
  • Diverse Representation:

    • Although there is clustering in the top ZIP codes, the fact that 20 different ZIP codes are among the highest frequencies indicates a breadth of geographical representation.
In [ ]:
# Converting the data type of categorical features to 'category'
## we will skip the Age, Experience, CCAvg, Mortgage, Income, Family and ZIP Code columns because they will have a lot of unique values
cat_cols = [
    "Education",
    "Personal_Loan",
    "Securities_Account",
    "CD_Account",
    "Online",
    "CreditCard",

]
personal_loan_df[cat_cols] = personal_loan_df[cat_cols].astype("category")
In [ ]:
# printing the number of occurrences of each unique value in each categorical column
for column in cat_cols:
    print(personal_loan_df[column].value_counts())
    print("-" * 50)
1    2096
3    1501
2    1403
Name: Education, dtype: int64
--------------------------------------------------
0    4520
1     480
Name: Personal_Loan, dtype: int64
--------------------------------------------------
0    4478
1     522
Name: Securities_Account, dtype: int64
--------------------------------------------------
0    4698
1     302
Name: CD_Account, dtype: int64
--------------------------------------------------
1    2984
0    2016
Name: Online, dtype: int64
--------------------------------------------------
0    3530
1    1470
Name: CreditCard, dtype: int64
--------------------------------------------------
In [ ]:
# Calculate the percentage of each unique value in the categorical columns
for column in cat_cols:
    print(personal_loan_df[column].value_counts(normalize=True) * 100)
    print("-" * 50)
1    41.92
3    30.02
2    28.06
Name: Education, dtype: float64
--------------------------------------------------
0    90.4
1     9.6
Name: Personal_Loan, dtype: float64
--------------------------------------------------
0    89.56
1    10.44
Name: Securities_Account, dtype: float64
--------------------------------------------------
0    93.96
1     6.04
Name: CD_Account, dtype: float64
--------------------------------------------------
1    59.68
0    40.32
Name: Online, dtype: float64
--------------------------------------------------
0    70.6
1    29.4
Name: CreditCard, dtype: float64
--------------------------------------------------

Observations:

  • Education Level:

    • 42% of customerS have an education level classified as 1, indicating an undergraduate qualification.
  • Loan Campaign Response:

    • 90% of customers did not accept the loan offered in the most recent campaign.
  • Securities Account:

    • 90% of customers do not hold a securities account with the bank.
  • Certificate of Deposit (CD) Account:

    • 94% of customers do not have a certificate of deposit (CD) account with the bank.
  • Internet Banking Usage:

    • 60% of customers use the bank's internet banking facilities.
  • Credit Card Usage (Other Banks):

    • 70% of customers do not use a credit card issued by any bank other than All Life Bank.
In [ ]:
# Creating categories from Age, CC Avg, and Income to analyze the trend of borrowing Personal Loan

personal_loan_df["income_bin"] = pd.cut(
    x=personal_loan_df["Income"],
    bins=[0, 39, 98, 224],
    labels=["Low", "Mid", "High"],
)

personal_loan_df["cc_spending_bin"] = pd.cut(
    x=personal_loan_df["CCAvg"],
    bins=[-0.0001, 0.7, 2.5, 10.0],
    labels=["Low", "Mid", "High"],
)

personal_loan_df["age_bin"] = pd.cut(
    x=personal_loan_df["Age"],
    bins=[0, 35, 55, 67],
    labels=["Young Adults", "Middle Aged", "Senior"],
)

Exploratory Data Analysis.

  • EDA is an important part of any project involving data.
  • It is important to investigate and understand the data better before building a model with it.
  • A few questions have been mentioned below which will help you approach the analysis in the right manner and generate insights from the data.
  • A thorough analysis of the data, in addition to the questions mentioned below, should be done.

Questions:

  1. What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?
  2. How many customers have credit cards?
  3. What are the attributes that have a strong correlation with the target attribute (personal loan)?
  4. How does a customer's interest in purchasing a loan vary with their age?
  5. How does a customer's interest in purchasing a loan vary with their education?
In [ ]:

Question 1 - What is the distribution of mortgage attribute? Are there any noticeable patterns or outliers in the distribution?

In [ ]:
# Import the matplotlib library and assign it to the alias 'plt'.
import matplotlib.pyplot as plt

# Import the seaborn library and assign it to the alias 'sns'
import seaborn as sns # Added this line to import seaborn

# Plot the distribution of mortgage using a histogram.
plt.figure(figsize=(12, 6))
sns.histplot(personal_loan_df['Mortgage'], bins=30, kde=True)
plt.title('Distribution of Mortgage Attribute')
plt.xlabel('Mortgage')
plt.ylabel('Frequency')
plt.grid(True)
plt.show()

# Plot the distribution of mortgage using a boxplot to identify outliers.
plt.figure(figsize=(12, 6))
sns.boxplot(x=personal_loan_df['Mortgage'])
plt.title('Boxplot of Mortgage Attribute')
plt.xlabel('Mortgage')
plt.grid(True)
plt.show()
In [ ]:
# Plot the distribution of mortgage using a violinplot to identify outliers.
plt.figure(figsize=(12, 6))
sns.violinplot(x=personal_loan_df['Mortgage'])
plt.title('Violinplot of Mortgage Attribute')
plt.xlabel('Mortgage')
plt.grid(True)
plt.show()
In [ ]:
# plot the mortgage cumulative density distribution

import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'personal_loan_df' is your DataFrame and 'Mortgage' is the column
plt.figure(figsize=(10, 6))
sns.ecdfplot(personal_loan_df['Mortgage'], label='Mortgage')
plt.title('Cumulative Distribution Function of Mortgage')
plt.xlabel('Mortgage Amount')
plt.ylabel('Cumulative Density')
plt.legend()
plt.grid(True)
plt.show()
In [ ]:
# Number of customers without mortgage.
no_mortgage_count = personal_loan_df[personal_loan_df['Mortgage'] == 0].shape[0]

# Total number of customers.
total_customers = personal_loan_df.shape[0]

# Percentage of customers without mortgage.
percentage_no_mortgage = (no_mortgage_count / total_customers) * 100

print(f"Number of customers without mortgage: {no_mortgage_count:,}")
print(f"Total number of customers: {total_customers:,}")
print(f'Percentage of customers without mortgage: {percentage_no_mortgage:.2f}%')
Number of customers without mortgage: 3,462
Total number of customers: 5,000
Percentage of customers without mortgage: 69.24%

Observations:

  • Mortgage Prevalence:
    • 69.24% of customers do not have a mortgage (i.e., a mortgage value of \$0), representing a significant portion of the customer base.
  • Mortgage Distribution Characteristics:

    • Mortgage values are highly right-skewed.
    • The range of mortgage values spans from \$0k to \$635k.
    • The maximum mortgage value is substantially higher than the third quartile (Q3), indicating the presence of extreme outliers.
  • Action Items Regarding Outliers:

    • The high-side mortgage outliers warrant further verification.
    • These outliers should be validated and appropriately treated in subsequent analyses.

    • Marketing Recommendation:

    • Consider targeting these customers with a special mortgage program that offers low introductory rates to capture potential business.

Question 2 - How many customers have credit cards?

In [ ]:
# Calculate the number of customers with credit cards from other banks
credit_card_customers = personal_loan_df[personal_loan_df['CreditCard'] == 1].shape[0]

# Calculate the total number of customers
total_customers = personal_loan_df.shape[0]

# Calculate the percentage of customers with credit cards from other banks
percentage_credit_card_customers = (credit_card_customers / total_customers) * 100

# Print the results
print(f"Number of customers with credit cards from other banks: {credit_card_customers}")
print(f"Total number of customers: {total_customers}")
print(f"Percentage of customers with credit cards from other banks: {percentage_credit_card_customers:.2f}%")
Number of customers with credit cards from other banks: 1470
Total number of customers: 5000
Percentage of customers with credit cards from other banks: 29.40%
In [ ]:
import matplotlib.pyplot as plt


credit_card_counts = personal_loan_df['CreditCard'].value_counts()

# Create the pie chart
plt.figure(figsize=(8, 8))  # Adjust figure size as needed
plt.pie(credit_card_counts, labels=['No Credit Card', 'Has Credit Card'], autopct='%1.1f%%', startangle=90)
plt.title('Distribution of Credit Card Usage')
plt.axis('equal')  # Equal aspect ratio ensures that pie is drawn as a circle
plt.show()

Observations

  • Credit Card Ownership:

    • 29.4% of customers have credit cards from other banks.
  • Marketing Recommendation:

    • Launch a special program offering low introductory APR to capture business from these customers and divert them from competitors.

Question 3 - What are the attributes that have a strong correlation with the target attribute (personal loan)?

In [ ]:
# Calculate the correlation matrix, including categorical features as numeric
correlation_matrix = personal_loan_df.apply(lambda x: pd.factorize(x)[0]).corr() # Changed this line to include categorical features

# Filter for correlations with 'Personal_Loan'
personal_loan_correlations = correlation_matrix['Personal_Loan'].drop('Personal_Loan')

# Print the correlations
print("Correlation with Personal Loan:")
print(personal_loan_correlations)

# Set the figure size
plt.figure(figsize=(10, 6))

# Create the heatmap
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm', fmt=".2f")
plt.title('Correlation Matrix Heatmap')
plt.show()

# Identify attributes with strong positive correlations
positive_correlations = personal_loan_correlations[personal_loan_correlations > 0.1]
print("\nAttributes with strong positive correlation with Personal Loan:")
print(positive_correlations)
Correlation with Personal Loan:
Age                   0.010024
Experience            0.001934
Income                0.209916
ZIPCode              -0.005038
Family               -0.053281
CCAvg                 0.286473
Education             0.136722
Mortgage              0.090520
Securities_Account   -0.021954
CD_Account            0.316355
Online                0.006278
CreditCard            0.002802
income_bin            0.430311
cc_spending_bin       0.044309
age_bin              -0.013886
Name: Personal_Loan, dtype: float64
Attributes with strong positive correlation with Personal Loan:
Income        0.209916
CCAvg         0.286473
Education     0.136722
CD_Account    0.316355
income_bin    0.430311
Name: Personal_Loan, dtype: float64

Observations

  • Strongest Correlations with Personal Loan:
    • Income: Higher income levels may increase the likelihood of loan acceptance, as customers with greater financial stability are more inclined to take on loans.
    • CCAvg (Credit Card Average Spending): Customers with higher credit card spending may demonstrate financial activity that aligns with loan eligibility or demand.
    • Education: Higher education levels could be associated with greater financial literacy or income potential, influencing loan acceptance decisions.
    • CD_Account (Certificate of Deposit Account): Owning a CD account may indicate financial planning habits, which could correlate with a willingness to accept a loan.

Question 4 - How does a customer's interest in purchasing a loan vary with their age?

In [ ]:
# Create age groups
bins = [20, 30, 40, 50, 60, 70, 80]  # Define bin edges for age groups
labels = ['20-29', '30-39', '40-49', '50-59', '60-69', '70+']
personal_loan_df['Age_Group'] = pd.cut(personal_loan_df['Age'], bins=bins, labels=labels, right=False)

# Group data by age group and calculate the proportion of customers interested in a loan
# Instead of .mean(), use .sum() / .count() to calculate the proportion
age_group_loan_interest = personal_loan_df.groupby('Age_Group')['Personal_Loan'].apply(lambda x: (x == 1).sum() / len(x))

# Create the bar chart
plt.figure(figsize=(12, 6))
plt.bar(age_group_loan_interest.index, age_group_loan_interest.values)
plt.xlabel('Age Group')
plt.ylabel('Proportion of Customers Interested in Loan')
plt.title('Interest in Loan by Age Group')
plt.grid(True)
plt.show()
<ipython-input-27-e959d9a5f260>:8: RuntimeWarning: invalid value encountered in scalar divide
  age_group_loan_interest = personal_loan_df.groupby('Age_Group')['Personal_Loan'].apply(lambda x: (x == 1).sum() / len(x))

Observations:

  • Loan Interest Across Age Groups:

    • There is no significant difference in personal loan interest across age groups.
    • Interest rates remain relatively stable across all groups, with minor fluctuations.
  • Breakdown of Loan Interest by Age:

    • 20-29 years: ~10.04%
    • 30-39 years: ~10.18%
    • 40-49 years: ~9.31%
    • 50-59 years: ~8.85%
    • 60-69 years: ~10.24%
  • Key Observations:

    • The highest interest is observed in the 30-39 and 60-69 age groups.
    • The lowest loan interest is in the 50-59 age group (~8.85%).
    • The variations in interest rates across age groups are minimal, suggesting that age may not be a strong determinant in loan interest levels.

Question 5 - How does a customer's interest in purchasing a loan vary with their education?

In [ ]:
# Calculate the proportion of customers interested in loans for each education level.
loan_interest_by_education = personal_loan_df.groupby('Education')['Personal_Loan'].apply(lambda x: x.astype(int).sum()) / personal_loan_df.groupby('Education')['Personal_Loan'].count() # Changed this line to calculate proportion

# Print proportion of customers interested in loans by education level.
# 1: Undergrad; 2: Graduate; 3: Advanced/Professional
print("Loan Interest by Education Level")
print("-" * 60)
for education_level, proportion in loan_interest_by_education.items():
    print(f"Education Level: {education_level}, Proportion: {proportion}")

# Visualize results.
loan_interest_by_education.plot(kind='bar', color='green')
plt.xlabel('Education Level')
plt.ylabel('Proportion of Customers Interested in Loans')
plt.title('Customer Interest in Purchasing Loans by Education Level')
plt.show()
Loan Interest by Education Level
------------------------------------------------------------
Education Level: 1, Proportion: 0.044370229007633585
Education Level: 2, Proportion: 0.12972202423378476
Education Level: 3, Proportion: 0.13657561625582945

Observations:

  • Loan Interest Among Education Groups:

    • Undergraduates: Only 4.44% show interest in personal loans, likely due to student debt burdens and aversion to high APR credit products.
    • Graduates: 12.97% show interest, possibly linked to greater financial stability and a broader understanding of money management.
    • Advanced/Professionals: The highest interest rate of 13.66%, which could be attributed to higher incomes and more established financial planning.
  • Recommendation:

    • Introduce a special loan program tailored for undergraduates, featuring low APR or no interest penalty for a limited time.
    • Target graduates and professionals with personalized loan offers aligned with their income stability and financial habits.

Additional Exploratory Data Analysis

Univariate Analysis

The first step of univariate analysis is to check the distribution/spread of the data. This is done using primarily histograms and box plots. Additionally we'll plot each numerical feature on violin plot and cumulative density distribution plot. For these 4 kind of plots, we are building below summary() function to plot each of the numerical attributes. Also, we'll display feature-wise 5 point summary

In [ ]:
!pip install tabulate -q --user
# Import the tabulate function from the tabulate library.
from tabulate import tabulate

def summary(x):
    '''
    The function prints the 5 point summary and histogram, box plot,
    violin plot, and cumulative density distribution plots for each
    feature name passed as the argument.

    Parameters:
    ----------

    x: str, feature name

    Usage:
    ------------

    summary('age')
    '''
    x_min = personal_loan_df[x].min()
    x_max = personal_loan_df[x].max()
    Q1 = personal_loan_df[x].quantile(0.25)
    Q2 = personal_loan_df[x].quantile(0.50)
    Q3 = personal_loan_df[x].quantile(0.75)

    dict={'Min': x_min, 'Q1': Q1, 'Q2': Q2, 'Q3': Q3, 'Max': x_max}
    df = pd.DataFrame(dict, index=['Value'])
    print(f'5 Point Summary of {x.capitalize()} Attribute:\n')
    print(tabulate(df, headers = 'keys', tablefmt = 'psql'))

    fig = plt.figure(figsize=(16, 8))
    plt.subplots_adjust(hspace = 0.6)
    sns.set_palette('Pastel1')

    plt.subplot(221, frameon=True)
    ax1 = sns.histplot(personal_loan_df[x], color = 'purple')
    ax1.axvline(
        np.mean(personal_loan_df[x]), color="purple", linestyle="--"
    )  # Add mean to the histogram
    ax1.axvline(
        np.median(personal_loan_df[x]), color="black", linestyle="-"
    )  # Add median to the histogram
    plt.title(f'{x} Density Distribution')

    plt.subplot(222, frameon=True)
    ax2 = sns.violinplot(x = personal_loan_df[x], palette = 'Accent', split = True)
    plt.title(f'{x.capitalize()} Violinplot')

    plt.subplot(223, frameon=True, sharex=ax1)
    ax3 = sns.boxplot(x=personal_loan_df[x], palette = 'cool', width=0.7, linewidth=0.6, showmeans=True)
    plt.title(f'{x.capitalize()} Boxplot')

    plt.subplot(224, frameon=True, sharex=ax2)
    ax4 = sns.kdeplot(personal_loan_df[x], cumulative=True) #

    plt.title(f'{x} Cumulative Density Distribution')

    plt.show()

Observation on Age

In [ ]:
summary('Age')
5 Point Summary of Age Attribute:

+-------+-------+------+------+------+-------+
|       |   Min |   Q1 |   Q2 |   Q3 |   Max |
|-------+-------+------+------+------+-------|
| Value |    23 |   35 |   45 |   55 |    67 |
+-------+-------+------+------+------+-------+

Observations:

  • Age Distribution:

    • The dataset shows a well-distributed age range but has five noticeable spikes, indicating concentrated age groups.
  • Key Age Statistics:

    • Minimum Age: 23 years
    • Maximum Age: 67 years
    • Mean Age: ~45 years
    • Median Age: ~45 years

Observation on Experience

In [ ]:
summary('Experience')
5 Point Summary of Experience Attribute:

+-------+-------+------+------+------+-------+
|       |   Min |   Q1 |   Q2 |   Q3 |   Max |
|-------+-------+------+------+------+-------|
| Value |     0 |   10 |   20 |   30 |    43 |
+-------+-------+------+------+------+-------+

Observations:

  • Experience Distribution:

    • The dataset shows a well-distributed range but exhibits four noticeable spikes, indicating concentrated groups.
  • Key Experience Statistics:

    • Minimum Experience: 0 years
    • Maximum Experience: 43 years
    • Mean Experience: ~20 years
    • Median Experience: ~20 years
  • Potential Correlation with Age:

    • The distribution of Experience appears suspiciously similar to Age, suggesting a possible correlation.
    • Further analysis using pairplots and heatmaps will help validate and quantify this relationship.

Observations on Income

In [ ]:
summary('Income')
5 Point Summary of Income Attribute:

+-------+-------+------+------+------+-------+
|       |   Min |   Q1 |   Q2 |   Q3 |   Max |
|-------+-------+------+------+------+-------|
| Value |     8 |   39 |   64 |   98 |   224 |
+-------+-------+------+------+------+-------+

Observations

  • Income Distribution:

    • Income is right-skewed, indicating that most customers have lower incomes, with a few high-income individuals pulling the distribution's tail.
  • Income Range:

    • The dataset spans from \$8k to \$224k, showing a wide disparity in earnings.
  • High-End Outliers:

    • The maximum income is significantly higher than Q3, requiring validation to confirm its accuracy.
    • There are outliers on the higher side, which need further distribution?investigation and possible treatment if required.

Observations on CCAvg

In [ ]:
summary('CCAvg')
5 Point Summary of Ccavg Attribute:

+-------+-------+------+------+------+-------+
|       |   Min |   Q1 |   Q2 |   Q3 |   Max |
|-------+-------+------+------+------+-------|
| Value |     0 |  0.7 |  1.5 |  2.5 |    10 |
+-------+-------+------+------+------+-------+

Observations

  • CC Avg Distribution:

    • The dataset shows a right-skewed distribution, indicating that most customers have lower credit card spending, while a few have significantly higher averages.
  • Range:

    • CC Avg spans from \$0k to \$10k, highlighting a broad spread in spending behavior.
  • High-End Outliers:

    • The maximum CC Avg is considerably higher than Q3, warranting validation to ensure data accuracy.
    • Numerous outliers on the higher side should be investigated and potentially treated based on further analysis.

Observations on Mortgage

In [ ]:
summary('Mortgage')
5 Point Summary of Mortgage Attribute:

+-------+-------+------+------+------+-------+
|       |   Min |   Q1 |   Q2 |   Q3 |   Max |
|-------+-------+------+------+------+-------|
| Value |     0 |    0 |    0 |  101 |   635 |
+-------+-------+------+------+------+-------+

Observations

  • Mortgage Distribution:

    • The dataset shows a highly right-skewed mortgage distribution, with a concentration of lower values and a few extreme high values.
  • Range:

    • Mortgage values span from \$0k to \$635k, indicating significant variability.
  • High-End Outliers:

    • The maximum mortgage value is far above Q3, confirming the presence of extreme outliers.
    • Numerous high-value outliers require validation to ensure data integrity and correct any potential anomalies.
  • Outlier Treatment:

    • The extreme values will be analyzed and treated based on statistical methods such as Winsorization, log transformation, or removal if necessary.
    • Further investigation is needed to determine whether these high values are genuine or due to data entry inconsistencies.

Percentage on Bar Chart for Categorical Features

Categorical variables are most effectively visualized as bar charts representing percentage of total, ensuring clearer insights into distribution patterns.


In [ ]:
def perc_on_bar(cat_columns):
    '''
    The function takes a category column as input and plots bar chart with percentages on top of each bar

    Usage:
    ------

    perc_on_bar('county')
    '''
    num_cols = len(cat_columns)
    # Calculate the number of rows needed
    num_rows = (num_cols + 1) // 2  # Add 1 to ensure enough rows for odd numbers of columns

    plt.figure(figsize=(16, 14))
    for i, col in enumerate(cat_columns):
        plt.subplot(num_rows, 2, i + 1)  # Use calculated num_rows and 2 columns
        order = personal_loan_df[col].value_counts(ascending=False).index  # Use personal_loan_df instead of data
        ax = sns.countplot(data=personal_loan_df, x=col, palette='crest', order=order)  # Use personal_loan_df instead of data
        for p in ax.patches:
            percentage = '{:.1f}%\n({})'.format(100 * p.get_height() / len(personal_loan_df['Personal_Loan']), p.get_height())
            # Added percentage and actual value
            x = p.get_x() + p.get_width() / 2
            y = p.get_y() + p.get_height() + 40
            plt.annotate(percentage, (x, y), ha='center', color='black', fontsize='medium')  # Annotation on top of bars
            plt.xticks(color='black', fontsize='medium', rotation=(-90 if col == 'region' else 0))
            plt.tight_layout()
            plt.title(col.capitalize() + ' Percentage Bar Charts\n\n')
In [ ]:
cat_columns = personal_loan_df.select_dtypes(exclude=[np.int64, np.float64]).columns.unique().tolist() # Only numerical columns

# Check if elements exist before removing them
if 'county' in cat_columns:
    cat_columns.remove('zip_code')
if 'personal_loan' in cat_columns:
    cat_columns.remove('Personal_loan')
if 'age_bin' in cat_columns:
    cat_columns.remove('age_bin')
if 'income_bin' in cat_columns:
    cat_columns.remove('income_bin')
if 'cc_spending_bin' in cat_columns:
    cat_columns.remove('cc_spending_bin')
if  'Age_Group' in cat_columns:
    cat_columns.remove('Age_Group')
if 'Family' in cat_columns:
    cat_columns.remove('Family')

perc_on_bar(cat_columns)

Observations

  • Education Level:

    • 42% of customers are Undergraduates, making up a significant portion of the dataset.
  • Financial Product Ownership:

    • 89.6% of customers do not have a Securities Account.
    • 94% of customers do not hold Certified Deposits (CDs).
  • Banking Preferences:

    • 59.7% of customers have Online Banking enabled, indicating a strong digital adoption rate.
    • 70.6% of customers do not have a credit card from other banks, suggesting a preference for exclusive banking relationships.
  • Personal Loan Uptake:

    • Only 9.6% of customers have taken a personal loan from the bank.
  • Key Implication:

    • The low adoption rate suggests that the majority of customers either do not need personal loans, find the terms unfavorable, or prefer other financing options.
  • Potential Actions:

    • Investigate the reasons behind the low acceptance rate (e.g., high interest rates, creditworthiness requirements, or alternative borrowing preferences). Optimize marketing strategies to increase engagement, possibly through tailored loan offerings or educational initiatives to inform customers about benefits
In [ ]:
cat_columns = ['income_bin', 'cc_spending_bin', 'age_bin']

perc_on_bar(cat_columns)

Observations

  • Age Distribution:

    • 50.2% of customers fall into the Middle-Aged category (35-55 years).
  • Income Levels:

    • 48.8% of customers are Mid-Level Earners, with annual incomes ranging from \$39K to \$98K.
  • Credit Card Spending Patterns:

    • 47.4% of customers spend between \$0.7K and \$2.5K on credit cards, indicating moderate usage.

Bivariate Analysis

Bivariate analysis focuses on identifying relationships between two variables to determine patterns, dependencies, or correlations. It plays a crucial role in understanding how one variable may influence another.

Personal Loans vs. All Numerical Columns

In [ ]:
#This function plots box charts for each numerical feature grouped by Personal Loan status (0: Not Borrowed, 1: Borrowed). It helps visualize distributions, identify trends, and detect outliers in loan acceptance analysis.

plt.style.use('ggplot') # Setting plot style
numeric_columns = personal_loan_df.select_dtypes(include=np.number).columns.unique().tolist() # Only numerical columns
# Check if 'ZIPCode' is in the list before attempting to remove it
if 'ZIPCode' in numeric_columns:
    numeric_columns.remove('ZIPCode') # Excluding zip code, as there are too many, and it won't make sense
# Add 'Family' to the list of numerical columns (it's treated as numerical here for the boxplot)
numeric_columns.append('Family')
plt.figure(figsize=(20,30))

for i, col in enumerate(numeric_columns):
    plt.subplot(8,2,i+1)
    sns.boxplot(data=personal_loan_df, x='Personal_Loan', y=col, orient='vertical', palette="Blues")
    plt.xticks(ticks=[0,1], labels=['No (0)', 'Yes (1)'])
    plt.tight_layout()
    plt.title(str(i+1)+ ': Personal Loan vs. ' + col, color='black')

Observations

  • No Clear Clustering Pattern:

    • No distinct grouping is observed between Personal Loan Opted vs. Age and Experience, suggesting that loan adoption is not significantly impacted by these factors.
  • Income Influence:

    • Customers with higher income are more likely to take personal loans, possibly due to greater financial confidence and repayment ability.
  • Family Size Impact:

    • Individuals with 2–4 family members show a greater tendency to opt for personal loans, potentially due to increased household expenses or financial commitments.
  • Mortgage Influence:

    • Customers with high mortgage amounts are more likely to take personal loans, possibly to supplement housing-related financial needs.
  • Credit Card Spending Correlation:

    • Customers with a higher credit card average spending tend to opt for personal loans, suggesting a higher overall financial engagement or borrowing behavior.

Personal Loan vs. Education

In [ ]:
# This function takes a categorical column as input and visualizes percentage distributions using bar charts, pie charts and stacked charts.

def cat_view(x):
    """
    Function to create a Bar chart and a Pie chart for categorical variables.
    """
    from matplotlib import cm
    color1 = cm.inferno(np.linspace(.4, .8, 30))
    color2 = cm.viridis(np.linspace(.4, .8, 30))
    sns.set_palette('cubehelix')
    fig, ax = plt.subplots(1, 2, figsize=(16, 4))


    """
    Draw a Pie Chart on first subplot.
    """
    s = personal_loan_df.groupby(x).size()

    mydata_values = s.values.tolist()
    mydata_index = s.index.tolist()
    def func(pct, allvals):
        absolute = int(pct/100.*np.sum(allvals))
        return "{:.1f}%\n({:d})".format(pct, absolute)


    wedges, texts, autotexts = ax[0].pie(mydata_values, autopct=lambda pct: func(pct, mydata_values),
                                      textprops=dict(color="w"))

    ax[0].legend(wedges, mydata_index,
              title=x.capitalize(),
              loc="center left",
              bbox_to_anchor=(1, 0, 0.5, 1))

    plt.setp(autotexts, size=12)

    ax[0].set_title(f'{x.capitalize()} Pie Chart')

    """
    Draw a Bar Graph on second subplot.
    """

    # Changed 'income' to 'Income' to match the actual column name in personal_loan_df
    df = pd.pivot_table(personal_loan_df, index = [x], columns = ['Personal_Loan'], values = ['Income'], aggfunc = len)


    labels = df.index.tolist()
    loan_no = df.values[:, 0].tolist()
    loan_yes = df.values[:, 1].tolist()

    l = np.arange(len(labels))  # the label locations
    width = 0.35  # the width of the bars

    rects1 = ax[1].bar(l - width/2, loan_no, width, label='No Loan', color = color1)
    rects2 = ax[1].bar(l + width/2, loan_yes, width, label='Loan', color = color2)

    # Add some text for labels, title and custom x-axis tick labels, etc.
    ax[1].set_ylabel('Scores')
    ax[1].set_title(f'{x.capitalize()} Bar Graph')
    ax[1].set_xticks(l)
    ax[1].set_xticklabels(labels)
    ax[1].legend()

    def autolabel(rects):

        """Attach a text label above each bar in *rects*, displaying its height."""

        for rect in rects:
            height = rect.get_height()
            ax[1].annotate('{}'.format(height),
                        xy=(rect.get_x() + rect.get_width() / 2, height),

xytext=(0, 3),  # 3 points vertical offset
                        textcoords="offset points",
                        fontsize = 'medium',
                        ha='center', va='bottom')


    autolabel(rects1)
    autolabel(rects2)

    fig.tight_layout()
    plt.show()

    """
    Draw a Stacked Bar Graph on bottom.
    """

    sns.set(palette="tab10")
    # Changed 'personal_loan' and 'data' to 'Personal_Loan' and 'personal_loan_df' respectively.
    tab = pd.crosstab(personal_loan_df[x], personal_loan_df['Personal_Loan'].map({0:'No Loan', 1:'Loan'}), normalize="index")

    tab.plot.bar(stacked=True, figsize=(16, 3))
    plt.title(x.capitalize() + ' Stacked Bar Plot')
    plt.legend(loc="upper right", bbox_to_anchor=(0,1))
    plt.show()
In [ ]:
cat_view('Education')

Observations

  • Loan Uptake by Education Level:

    • Customers with an Advanced/Professional education level have taken personal loans at a higher rate compared to Graduates and Undergraduates.
  • Possible Explanation:

    • Advanced/Professional individuals may have higher financial stability, making them more comfortable with borrowing.
    • They might also have higher income levels or better creditworthiness, increasing their chances of loan approval.
  • Potential Action:

    • Tailor loan offers to professionals, emphasizing competitive interest rates and financial flexibility.
    • Consider targeted marketing strategies for graduates and undergraduates to improve loan adoption.

Personal Loan vs. Family

In [ ]:
cat_view('Family')

Observations

  • Loan Uptake by Family Size:

    • Customers with family sizes of 3 or more are more likely to take personal loans compared to those with smaller families.
  • Possible Explanation:

    • Larger families often have higher household expenses, prompting a greater need for financial support.
    • Increased financial responsibilities (education, housing, daily living costs) may drive loan demand.
  • Potential Action:

    • Consider targeted loan offerings for families, highlighting flexible repayment plans and competitive rates.
    • Explore bundling financial products that cater specifically to larger households, such as family-focused insurance or savings programs.

Persoanl Loan vs. Securities Account

In [ ]:
cat_view('Securities_Account')

Observations

  • Loan vs. Securities Account Relationship:

    • 60 customers who took a Personal Loan also held a Securities Account.
    • However, 420 customers with Securities Accounts did not borrow a Personal Loan.
  • Possible Interpretation:

    • Customers with Securities Accounts may have alternative financial assets, reducing their need for loans.
    • Those who did take a loan despite having a Securities Account might be leveraging their assets for investment purposes or liquidity needs.
  • Potential Action:

    • Investigate whether Securities Account holders represent a viable segment for targeted loan offerings.
    • Consider educational outreach on how personal loans can complement investment strategies.

Personal Loan vs. Online Banking

In [ ]:
cat_view('Online')

Observations

  • Impact of Online Banking on Loan Uptake:

    • Whether a customer uses Online Banking or not does not significantly influence their decision to take a Personal Loan.
  • Possible Explanation:

    • Loan adoption is likely driven by financial factors such as income, credit history, and spending habits, rather than digital banking usage.
    • Customers who use online banking may already have access to alternative financial tools, reducing their need for loans.
  • Potential Action:

    • Focus marketing efforts on income-based segmentation rather than targeting online banking users.
    • Investigate deeper customer behavioral trends that might offer stronger predictive indicators for loan adoption.

Personal Loan vs. CD Account

In [ ]:
cat_view('CD_Account')

Observations

  • Certified Deposit & Loan Uptake:

    • Customers with Certified Deposits (CDs) are more likely to take Personal Loans—almost 50% of CD holders have borrowed.
    • This suggests that customers with CD accounts might be financially engaged, using personal loans for strategic financial management.
  • Majority Behavior:

    • Out of 5000 customers, 4358 (87.2%) do not have a Certified Deposit Account—and did not borrow a Personal Loan.
    • This indicates a strong correlation between CD ownership and loan adoption, but also highlights that most customers do not use CDs or personal loans.
  • Potential Action:

    • Investigate whether CD holders are using personal loans for investment or liquidity purposes.
    • Consider targeted loan offers for CD owners, emphasizing benefits of leveraging their deposits for financial flexibility.
    • Explore educational campaigns for non-CD holders on structured financial products and their advantages.

Personal Loan vs. Credit Card

In [ ]:
cat_view('CreditCard')

Observations

  • Loan Adoption vs. External Credit Cards:

    • Customers who borrowed a Personal Loan mostly did not use credit cards from other banks.
  • Possible Interpretation:

    • These customers may prefer centralized financial relationships, keeping all banking products within the same institution.
    • They might also have stronger financial ties with the bank, receiving better loan offers or bundled incentives.
  • Potential Action:

    • Banks could introduce cross-selling opportunities, offering credit card promotions to loan customers.
    • Investigate whether these customers have higher loyalty rates, possibly influencing retention strategies.

Personal Loans vs. Age

In [ ]:
cat_view('age_bin')

Observations

  • Loan Adoption Among Middle-Aged Customers:

    • The majority of customers who borrowed a Personal Loan fall within the 35-55 age range, indicating strong loan engagement from middle-aged individuals.
  • Possible Explanation:

    • Middle-aged customers may have higher financial stability, making them more comfortable taking loans.
    • This age group often faces major financial commitments, such as mortgages, family expenses, or investment opportunities, increasing their need for loans.
  • Potential Action:

    • Banks could offer tailored loan packages for middle-aged borrowers, such as flexible repayment plans or low-interest refinancing options.
    • Investigate if younger or older age groups have different loan behavior trends, refining loan product segmentation accordingly.

Personal Loan vs. Zip Code

In [ ]:
# prompt: show bivariate relationship between personal loan (no loan or loan) vs. the top 10 Zip codes as bar chart

import matplotlib.pyplot as plt
import seaborn as sns

# Assuming 'personal_loan_df' is your DataFrame

# Get the top 10 zip codes
top_10_zipcodes = personal_loan_df['ZIPCode'].value_counts().nlargest(10).index

# Filter the DataFrame to include only the top 10 zip codes
top_zip_df = personal_loan_df[personal_loan_df['ZIPCode'].isin(top_10_zipcodes)]

plt.figure(figsize=(12, 6))
sns.countplot(data=top_zip_df, x='ZIPCode', hue='Personal_Loan')
plt.title('Personal Loan vs. Top 10 Zip Codes')
plt.xlabel('Zip Code')
plt.ylabel('Count')
plt.xticks(rotation=45)
plt.legend(title='Personal Loan')
plt.tight_layout()
plt.show()

Personal Loan vs. Income

In [ ]:
cat_view('income_bin')

Observations

  • Loan Adoption Among High-Income Customers:

    • The majority of personal loan borrowers have high incomes within the \$98K–\$224K range, suggesting a strong correlation between earnings and loan uptake.
  • Possible Explanation:

    • High-income individuals might have greater financial leverage, enabling them to secure better loan terms with lower interest rates.
    • They may also use personal loans strategically, such as for investment opportunities or tax optimization purposes rather than financial necessity.
  • Potential Action:

    • Banks could introduce premium loan products tailored for high-income earners, featuring exclusive benefits like lower interest rates, personalized repayment plans, or bundled financial services.
    • Investigate whether income brackets impact loan repayment behavior, refining risk assessment models for more effective loan offerings.

Personal Loan vs. CC Average

In [ ]:
cat_view('cc_spending_bin')

Observations

Loan Adoption Among High-Spending Customers:

Customers with high expenditures in the $2.5K–$10K range are more likely to have taken a Personal Loan, suggesting a correlation between spending behavior and loan uptake. Possible Explanation:

High-spending individuals may have greater financial needs, leading them to seek loans for liquidity or lifestyle maintenance. They might also use loans strategically to manage cash flow, handle large purchases, or consolidate debt. Potential Action:

Banks could offer personalized loan products for high-spending customers, with features like flexible repayment plans or credit-linked benefits. Investigate whether spending categories (e.g., travel, luxury goods, or business expenses) influence loan adoption, refining financial insights further.

Multivariate Analysis

Education Level vs. Income by Personal Loan

In [ ]:
# Below code shows swarm plot for customers by Income and Education level, seggregated by Personal Loan opted or not

sns.set(palette='icefire')
plt.figure(figsize=(15,5))
# Changed 'education', 'income' and 'personal_loan' to 'Education', 'Income', and 'Personal_Loan' to match column names.
sns.swarmplot(data=personal_loan_df, x='Education', y='Income', hue='Personal_Loan').set(title='Swarmplot: Education vs Income by Personal Loan\n0: Not Borrowed, 1: Borrowed');
plt.legend(loc="upper left" ,title="Opted Personal Loan", bbox_to_anchor=(1,1));
/root/.local/lib/python3.11/site-packages/seaborn/categorical.py:3398: UserWarning: 15.4% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
/root/.local/lib/python3.11/site-packages/seaborn/categorical.py:3398: UserWarning: 17.7% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
/root/.local/lib/python3.11/site-packages/seaborn/categorical.py:3398: UserWarning: 19.5% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
/root/.local/lib/python3.11/site-packages/seaborn/categorical.py:3398: UserWarning: 30.8% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
/root/.local/lib/python3.11/site-packages/seaborn/categorical.py:3398: UserWarning: 31.0% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)
/root/.local/lib/python3.11/site-packages/seaborn/categorical.py:3398: UserWarning: 34.0% of the points cannot be placed; you may want to decrease the size of the markers or use stripplot.
  warnings.warn(msg, UserWarning)

Observations

  • Loan Uptake Among Educated & High-Income Customers:
  • Customers with higher education levels and higher income brackets show a stronger tendency to borrow personal loans

Age vs. Mortgage Value by Personal Loan

Observations

Loan Uptake Among High-Mortgage Customers:

  • Customers with higher mortgage amounts tend to opt for personal loans more frequently.

Possible Explanation

  • Large mortgage obligations may drive a need for additional liquidity, leading individuals to seek personal loans for financial flexibility.
  • High-mortgage holders might use personal loans for home improvements, debt consolidation, or temporary cash flow management

Income vs. Mortgage Value by Personal Loan

In [ ]:
sns.set_palette('tab10')

# Changed 'income', 'mortgage', and 'personal_loan' to 'Income', 'Mortgage', and 'Personal_Loan' respectively.
sns.jointplot(data=personal_loan_df, x='Income', y='Mortgage', \
              hue='Personal_Loan');

Income vs. CCAverage by Personal Loan

In [ ]:
sns.set_palette('tab10')

# Changed 'income', 'mortgage', and 'personal_loan' to 'Income', 'Mortgage', and 'Personal_Loan' respectively.
sns.jointplot(data=personal_loan_df, x='Income', y='CCAvg', \
              hue='Personal_Loan');

Observations

  • Financial Profile & Loan Uptake:

    • High-profile customers (higher income, mortgage, and credit card expenditure) tend to borrow personal loans more frequently.
    • Low-profile customers (low income, mortgage, and spending) rarely borrow personal loans.
    • Mid-profile customers (moderate values across these features) have a mixed tendency, meaning loan uptake varies within this group.
  • Significance of These Features:

    • Income, mortgage, and spending behavior serve as strong indicators for categorizing customers into High, Mid, and Low financial profiles.
    • These insights help refine risk assessment, loan marketing strategies, and personalized financial offerings.

Pairplot of all available numeric columns, hued by Personal Loan

In [ ]:
# Below plot shows correlations between the numerical features in the dataset

plt.figure(figsize=(20,20));
sns.set(palette="nipy_spectral");
sns.pairplot(data=personal_loan_df, hue='Personal_Loan', corner=True); # Changed 'personal_loan' to 'Personal_Loan'
<Figure size 2000x2000 with 0 Axes>

Heatmap to visualize and analyse correlations between independent and dependent variables

In [ ]:
# Plotting correlation heatmap of the features

category_columns = ['Personal_Loan', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'Education'] # Changed column names to match actual names in the DataFrame
personal_loan_df[category_columns] = personal_loan_df[category_columns].astype('int')

# Selecting only numerical columns for correlation calculation
numerical_df = personal_loan_df.select_dtypes(include=np.number)


sns.set(rc={"figure.figsize": (15, 15)})
sns.heatmap(
    numerical_df.corr(), # Calculating correlation on numerical_df
    annot=True,
    linewidths=0.5,
    center=0,
    cbar=False,
    cmap="YlGnBu",
    fmt="0.2f",
)
plt.show()

personal_loan_df[category_columns] = personal_loan_df[category_columns].astype('category')

Observations

Correlation Between Age & Experience

  • Age and Experience show a high correlation, indicating redundancy in the data.
  • Since Experience had imputed values, it may not be as reliable as Age for modeling purposes.

Decision to Drop Experience

  • Dropping Experience is a logical choice to avoid multicollinearity in predictive models.
  • Retaining Age as the independent variable ensures data integrity and model accuracy.

Income vs. Education by Personal Loan

In [ ]:
sns.set(palette='Accent')
#Income Vs Education Vs Personal_Loan
plt.figure(figsize=(15,5))
sns.boxplot(data=personal_loan_df,y='Income',x='Education',hue='Personal_Loan') # Changed 'income', 'education', and 'personal_loan' to 'Income', 'Education', and 'Personal_Loan' respectively
plt.show()

Observations

  • Education & Income Correlation:

    • Higher education levels are associated with higher mean income, showing a direct relationship between academic attainment and financial earnings.
  • Loan Adoption by Education Level:

    • Customers with Graduate and Advanced education levels who took personal loans have a significantly higher mean income compared to those with an Undergraduate education.
    • This suggests that financially stable, educated individuals may be more comfortable with leveraging loans for strategic purposes, such as investments or major purchases.
  • Potential Action:

    • Tailor loan offerings to graduate and advanced-level customers with customized benefits, such as lower interest rates or flexible repayment plans.
    • Investigate whether occupation types or financial behaviors further refine loan adoption trends across education levels.

Income vs. Family Size by Personal Loan

In [ ]:
#Income Vs Family Vs Personal_Loan
plt.figure(figsize=(15,5))
sns.boxplot(data=personal_loan_df,y='Income',x='Family',hue='Personal_Loan') # Changed 'income', 'family', and 'personal_loan' to 'Income', 'Family', and 'Personal_Loan' respectively
plt.show()

Observations

  • Higher Income Across All Family Groups:

    • Customers who borrowed a Personal Loan exhibit significantly higher income levels across all family sizes compared to non-borrowers.
  • Possible Explanation:

    • Higher-income individuals may qualify for better loan terms, making them more likely to use personal loans.
    • They might also leverage loans strategically for investments, asset purchases, or liquidity management rather than financial necessity.
  • Potential Action:

    • Banks could develop premium loan products tailored for higher-income groups, offering exclusive benefits like lower interest rates or flexible repayment options.
    • Further segmentation could explore whether family size impacts loan purpose, refining loan marketing strategies based on household dynamics.

Mortgage Value vs. Family size by Personal Loan

In [ ]:
#Mortage Value Vs Family Vs Personal_Loan

plt.figure(figsize=(15,5))
sns.boxplot(data=personal_loan_df,y='Mortgage',x='Family',hue='Personal_Loan') # Changed 'mortgage', 'family', and 'personal_loan' to 'Mortgage', 'Family', and 'Personal_Loan' respectively
plt.show()

Observations

  • Outliers in Family Size (1-2) for Non-Borrowers:

    • Customers with smaller family sizes (1 or 2 members) who did not borrow a Personal Loan show several outliers compared to the general trend.
    • These individuals may have unique financial behaviors, such as exceptionally high or low income, irregular spending, or alternative financial strategies.
  • Mortgage vs. Family Size Relationship:

    • As family size increases, mortgage values tend to rise, likely due to higher housing costs for larger households.
    • Customers with larger families also borrow personal loans more frequently, indicating a stronger need for financial support.
  • Potential Action:

    • Investigate the nature of outliers within the small family size category—are they high earners avoiding loans, or individuals with unusual financial circumstances?
    • Explore how family size impacts loan repayment trends, refining risk assessment strategies based on household financial needs.

CC Average vs. Credit Card by Personal Loan

In [ ]:
#CCAvg Vs Credit Card Vs Personal_Loan

plt.figure(figsize=(15,5))
sns.boxplot(data=personal_loan_df,y='CCAvg',x='CreditCard',hue='Personal_Loan') # Changed 'ccavg', 'creditcard', and 'personal_loan' to 'CCAvg', 'CreditCard', and 'Personal_Loan' respectively
plt.show()

Observations

  • Credit Card Usage & Loan Adoption:

    • Customers who borrowed personal loans tend to have a higher average credit card expenditure, indicating a correlation between financial engagement and loan uptake.
  • Outliers in Non-Borrowers:

    • Several outliers exist among customers who did not take personal loans, suggesting divergent financial behaviors.
    • These outliers might include individuals with high credit usage but no loan needs, or low-credit users who simply avoid borrowing.
  • Potential Action:

    • Further outlier analysis (IQR, Z-score methods) could reveal whether income level or spending patterns explain these anomalies.
    • Investigate if high credit card users who haven’t taken loans are more inclined to alternative financial products.

Additional Insights from the EDA

  • Customer Segmentation Based on Financial Behavior:

    • High-profile customers (higher income, mortgage, and credit card expenditure) tend to actively borrow personal loans, likely due to financial confidence, investment opportunities, or debt management strategies.
    • Low-profile customers (low income, mortgage, and credit card spending) rarely borrow personal loans, possibly due to limited financial needs or restrictive loan eligibility.
    • Mid-profile customers exhibit mixed borrowing tendencies, indicating case-by-case financial decisions, possibly influenced by external factors such as interest rates, savings, or short-term liquidity needs.
  • Significance of These Features:

    • These three financial indicators—income, mortgage, and credit card expenditure—are highly valuable for customer profiling and loan strategy optimization.
    • They can be used for risk assessment models, loan marketing segmentation, and predictive analytics to refine banking services.

These observations provide powerful insights into customer loan adoption patterns and key predictors of personal loan uptake. Here’s a refined breakdown:

Key Drivers of Personal Loan Adoption

Higher Income → Increased Loan Uptake

  • Individuals with higher earnings tend to take personal loans, possibly due to greater financial confidence and better loan eligibility.

Family Size Influence

  • Customers with 2–4 family members are likelier to borrow personal loans, potentially to support household expenses.
  • Larger families (3+) show even stronger loan adoption, aligning with higher financial needs.

Mortgage & Loan Correlation

  • High-mortgage customers tend to opt for personal loans, likely to manage property costs or consolidate debt.

Credit Card Usage & Loan Uptake

  • Individuals with higher credit card averages tend to borrow personal loans, suggesting a financially active segment.

Education & Financial Standing

Advanced/Professional Education → More Loan Borrowers

  • More Graduate and Advanced degree holders borrow loans compared to Undergraduates, likely due to higher earning potential and credit access.

Certified Deposit (CD) Accounts & Loan Trends

  • 50% of CD holders have borrowed personal loans.
  • If a customer doesn’t have a CD account, they’re significantly less likely to take a loan (only 642 out of 5000 borrowed).

Customers from ZIP CODES like 94720, 95616 and 94305 (amonsgt others) opt for personal loans more frequently, suggesting regional financial behaviors or market dynamics.
Middle-aged customers (35–55 years) are the majority loan borrowers**, aligning with life stages where financial commitments peak.

Most Important Features for Predicting Loan Adoption

📌 Income
📌 Family Size
📌 Education Level
📌 CD Account Ownership
📌 Region

Outlier Treatment for Right-Skewed Data:

  • Income, Credit Card Average, and Mortgage exhibit right-skewed distributions with higher outliers, which need proper handling before modeling.
  • Consider applying log transformations or capping extreme values using IQR filtering to normalize the data.

Dropping Unnecessary Columns:

  • Experience, Age Bin, CC Avg Bin, Income Bin, Zip Code should be removed as they either show redundancy or do not add predictive value to the model.
  • This ensures better feature selection, preventing unnecessary complexity.

Data Preprocessing

- Feature Removal for Optimization:

  • Dropping Experience, Age Bin, CC Avg Bin, Income Bin, Zip Code, County eliminates redundancy and non-essential variables, improving model efficiency and avoiding overfitting.

- Outlier Capping Using Whiskers:

  • Outliers are treated using IQR-based capping, ensuring values below the lower whisker are adjusted to the lower whisker value and those above the upper whisker are capped at the upper whisker value.
  • This preserves data integrity while minimizing distortions caused by extreme values.

Duplication of Dataset

In [ ]:
# Create a deep copy of the original dataset before applying preprocessing
personal_loan_df = personal_loan_df.copy()

Statistical Summary

In [ ]:
personal_loan_df.describe(include='all').T
Out[ ]:
count unique top freq mean std min 25% 50% 75% max
Age 5000.0 NaN NaN NaN 45.3384 11.463166 23.0 35.0 45.0 55.0 67.0
Experience 5000.0 NaN NaN NaN 20.1346 11.415189 0.0 10.0 20.0 30.0 43.0
Income 5000.0 NaN NaN NaN 73.7742 46.033729 8.0 39.0 64.0 98.0 224.0
ZIPCode 5000.0 NaN NaN NaN 93169.257 1759.455086 90005.0 91911.0 93437.0 94608.0 96651.0
Family 5000.0 NaN NaN NaN 2.3964 1.147663 1.0 1.0 2.0 3.0 4.0
CCAvg 5000.0 NaN NaN NaN 1.937938 1.747659 0.0 0.7 1.5 2.5 10.0
Education 5000.0 3.0 1.0 2096.0 NaN NaN NaN NaN NaN NaN NaN
Mortgage 5000.0 NaN NaN NaN 56.4988 101.713802 0.0 0.0 0.0 101.0 635.0
Personal_Loan 5000.0 2.0 0.0 4520.0 NaN NaN NaN NaN NaN NaN NaN
Securities_Account 5000.0 2.0 0.0 4478.0 NaN NaN NaN NaN NaN NaN NaN
CD_Account 5000.0 2.0 0.0 4698.0 NaN NaN NaN NaN NaN NaN NaN
Online 5000.0 2.0 1.0 2984.0 NaN NaN NaN NaN NaN NaN NaN
CreditCard 5000.0 2.0 0.0 3530.0 NaN NaN NaN NaN NaN NaN NaN
income_bin 5000 3 Mid 2442 NaN NaN NaN NaN NaN NaN NaN
cc_spending_bin 5000 3 Mid 2371 NaN NaN NaN NaN NaN NaN NaN
age_bin 5000 3 Middle Aged 2510 NaN NaN NaN NaN NaN NaN NaN
Age_Group 5000 5 50-59 1334 NaN NaN NaN NaN NaN NaN NaN

Dropping Unneccessary Columns

In [ ]:
# Correct the column names to match the DataFrame
personal_loan_df.drop(columns=['Experience', 'ZIPCode', 'income_bin', 'cc_spending_bin', 'age_bin'], inplace=True)

Updated Statiscal Summary

In [ ]:
personal_loan_df.describe(include='all').T
Out[ ]:
count unique top freq mean std min 25% 50% 75% max
Age 5000.0 NaN NaN NaN 45.3384 11.463166 23.0 35.0 45.0 55.0 67.0
Income 5000.0 NaN NaN NaN 73.7742 46.033729 8.0 39.0 64.0 98.0 224.0
Family 5000.0 NaN NaN NaN 2.3964 1.147663 1.0 1.0 2.0 3.0 4.0
CCAvg 5000.0 NaN NaN NaN 1.937938 1.747659 0.0 0.7 1.5 2.5 10.0
Education 5000.0 3.0 1.0 2096.0 NaN NaN NaN NaN NaN NaN NaN
Mortgage 5000.0 NaN NaN NaN 56.4988 101.713802 0.0 0.0 0.0 101.0 635.0
Personal_Loan 5000.0 2.0 0.0 4520.0 NaN NaN NaN NaN NaN NaN NaN
Securities_Account 5000.0 2.0 0.0 4478.0 NaN NaN NaN NaN NaN NaN NaN
CD_Account 5000.0 2.0 0.0 4698.0 NaN NaN NaN NaN NaN NaN NaN
Online 5000.0 2.0 1.0 2984.0 NaN NaN NaN NaN NaN NaN NaN
CreditCard 5000.0 2.0 0.0 3530.0 NaN NaN NaN NaN NaN NaN NaN
Age_Group 5000 5 50-59 1334 NaN NaN NaN NaN NaN NaN NaN

Outlier Treatment

In [ ]:
# prompt: how can I append family to the numerical column

def add_family_to_numerical(df, column_name):
    """
    Appends "Family" to numerical column values in a Pandas DataFrame.

    Args:
        df: The Pandas DataFrame.
        column_name: The name of the numerical column to modify.

    Returns:
        The modified DataFrame or None if the column does not exist.
    """

    if column_name not in df.columns:
        print(f"Error: Column '{column_name}' not found in the DataFrame.")
        return None

    # Ensure the target column is of a numeric type
    if not pd.api.types.is_numeric_dtype(df[column_name]):
        print(f"Error: Column '{column_name}' is not numeric.")
        return None

    # Convert numerical column values to strings and append "Family"
    df[column_name] = df[column_name].astype(str) + "_Family"
    return df

# Example usage (assuming 'personal_loan_df' is your DataFrame):
# modified_df = add_family_to_numerical(personal_loan_df, "Income")
# if modified_df is not None:
#     print(modified_df.head())
In [ ]:
numerical_col = personal_loan_df.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(20,5))

for i, variable in enumerate(numerical_col):
    plt.subplot(1,5,i+1)
    plt.boxplot(personal_loan_df[variable],whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()

Outlier Analysis in Key Financial Features:

  • Income, Credit Card Average (CC Avg), and Mortgage exhibit high-end outliers, meaning extreme values skew the distribution.

Treating Outliers

To manage extreme values effectively, we will develop two functions following this approach:

✔ Values below the Lower Whisker will be adjusted to match the Lower Whisker threshold.

✔ Values above the Upper Whisker will be capped at the Upper Whisker limit, ensuring a balanced distribution. This method preserves data integrity

In [ ]:
def treat_outliers(personal_loan_df,col):
    '''
    treats outliers in a varaible
    col: str, name of the numerical varaible
    data: data frame
    col: name of the column
    '''
    Q1=personal_loan_df[col].quantile(0.25) # 25th quantile
    Q3=personal_loan_df[col].quantile(0.75)  # 75th quantile
    IQR=Q3-Q1
    Lower_Whisker = Q1 - 1.5*IQR
    Upper_Whisker = Q3 + 1.5*IQR
    personal_loan_df[col] = np.clip(personal_loan_df[col], Lower_Whisker, Upper_Whisker)
    # all the values smaller than Lower_Whisker will be assigned value of Lower_whisker
    # and all the values above upper_whisker will be assigned value of upper_Whisker
    return personal_loan_df

def treat_outliers_all(personal_loan_df, col_list):
    '''
    treat outlier in all numerical varaibles
    col_list: list of numerical varaibles
    data: data frame
    '''
    for col in col_list:
        personal_loan_df = treat_outliers(personal_loan_df,col)
    return personal_loan_df
In [ ]:
numerical_col = personal_loan_df.select_dtypes(include=np.number).columns.tolist()
# getting list of numerical columns

numerical_col.remove('Age')
numerical_col.remove('Family')

# treating outliers
data_loan = treat_outliers_all(personal_loan_df,numerical_col)

Verify Outlier Treatment

In [ ]:
numerical_col =personal_loan_df.select_dtypes(include=np.number).columns.tolist()
plt.figure(figsize=(20,5))

for i, variable in enumerate(numerical_col):
    plt.subplot(1,5,i+1)
    plt.boxplot(personal_loan_df[variable],whis=1.5)
    plt.tight_layout()
    plt.title(variable)

plt.show()

There are no more outliers in our dataset

Creating our Training and Testing Data

We'll split the dataset first into dependent and independent variable sets. Then we'll One-Hot encode the categorical columns Education and Zip Code only. The rest of the categorical columns are not being encoded as they hold binary values, 1 or 0. After that we split the datasets into training and testing dataset (30% for testing).

In [ ]:
from sklearn.model_selection import train_test_split

# Assuming you have your features in 'X' and target in 'y'
# 'X' should now contain the dummy variables for 'Education' and 'ZIPCode'

# Split the data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.30, random_state=6)

# Print shapes to verify the split
print("X_train shape:", X_train.shape)
print("X_test shape:", X_test.shape)
print("y_train shape:", y_train.shape)
print("y_test shape:", y_test.shape)
print('Percentage of classes in training set:\n',y_train.value_counts(normalize=True)*100)
print('Percentage of classes in test set:\n',y_test.value_counts(normalize=True)*100)
X_train shape: (3500, 11)
X_test shape: (1500, 11)
y_train shape: (3500,)
y_test shape: (1500,)
Percentage of classes in training set:
 0    90.342857
1     9.657143
Name: Personal_Loan, dtype: float64
Percentage of classes in test set:
 0    90.533333
1     9.466667
Name: Personal_Loan, dtype: float64

We have split the dataset into training and testing sets. In both datasets, the target variable shows a 91:9 class distribution, where 91% of customers did not take a personal loan, while 9% did. This class imbalance will be considered when adjusting the class_weight parameter during model training to ensure better handling of the minority class.

  • Missing value treatment
  • Feature engineering (if needed)
  • Outlier detection and treatment (if needed)
  • Preparing data for modeling
  • Any other preprocessing steps (if needed)

Model Building

Model Evaluation Criterion

Minimizing False Negatives in Loan Predictions

A model can make incorrect predictions in two ways:

✔ False Positive: Predicting a person will take a loan, but they actually don’t → Loss of Resources

✔ False Negative: Predicting a person won’t take a loan, but they actually do → Loss of Opportunity

Since the primary goal of the campaign is to bring in more customers, reducing False Negatives is the priority. If a potential customer is missed by the sales/marketing team, it represents a lost opportunity for conversion.

Optimizing for Recall

To minimize missed opportunities, the model should maximize Recall, ensuring it correctly identifies both classes. Higher Recall improves the chances of detecting potential loan adopters, even at the cost of slightly more False Positives.

*

Model Building

Scoring and Confusion Matrix

In [ ]:
def get_metrics_score(model,train,test,train_y,test_y,threshold=0.5,flag=True,roc=False):
    '''
    Function to calculate different metric scores of the model - Accuracy, Recall, Precision, and F1 score
    model: classifier to predict values of X
    train, test: Independent features
    train_y,test_y: Dependent variable
    threshold: thresold for classifiying the observation as 1
    flag: If the flag is set to True then only the print statements showing different will be displayed. The default value is set to True.
    roc: If the roc is set to True then only roc score will be displayed. The default value is set to False.
    '''
    # defining an empty list to store train and test results

    score_list=[]

    pred_train = (model.predict_proba(train)[:,1]>threshold)
    pred_test = (model.predict_proba(test)[:,1]>threshold)

    pred_train = np.round(pred_train)
    pred_test = np.round(pred_test)

    train_acc = accuracy_score(pred_train,train_y)
    test_acc = accuracy_score(pred_test,test_y)

    train_recall = recall_score(train_y,pred_train)
    test_recall = recall_score(test_y,pred_test)

    train_precision = precision_score(train_y,pred_train)
    test_precision = precision_score(test_y,pred_test)

    train_f1 = f1_score(train_y,pred_train)
    test_f1 = f1_score(test_y,pred_test)


    score_list.extend((train_acc,test_acc,train_recall,test_recall,train_precision,test_precision,train_f1,test_f1,pred_train,pred_test))


    if flag == True:
        print("Accuracy on training set : ",accuracy_score(pred_train,train_y))
        print("Accuracy on test set : ",accuracy_score(pred_test,test_y))
        print("Recall on training set : ",recall_score(train_y,pred_train))
        print("Recall on test set : ",recall_score(test_y,pred_test)) # Corrected indentation for Recall on test set
        print("Precision on training set : ",precision_score(train_y,pred_train)) # Corrected indentation for Precision on training set
        print("Precision on test set : ",precision_score(test_y,pred_test))
        print("F1 on training set : ",f1_score(train_y,pred_train))
        print("F1 on test set : ",f1_score(test_y,pred_test))

    if roc == True:
        pred_train_prob = model.predict_proba(train)[:,1]
        pred_test_prob = model.predict_proba(test)[:,1]
        print("ROC-AUC Score on training set : ",roc_auc_score(train_y,pred_train))
        print("ROC-AUC Score on test set : ",roc_auc_score(test_y,pred_test))

    return score_list # returning the list with train and test scores
In [ ]:
def make_confusion_matrix(model,test_X,y_actual,i,seg,labels=[1, 0]):
    '''
    model : classifier to predict values of X
    test_X: test set
    y_actual : ground truth

    '''
    y_predict = model.predict(test_X)
    cm=metrics.confusion_matrix( y_actual, y_predict, labels=[1,0])
    df_cm = pd.DataFrame(cm, index = [i for i in ['Actual - Borrowed', 'Actual - Not Borrowed']],
                  columns = [i for i in ['Predicted - Borrowed','Predicted - Not Borrowed']])
    group_counts = ["{0:0.0f}".format(value) for value in
                cm.flatten()]
    group_percentages = ["{0:.2%}".format(value) for value in
                         cm.flatten()/np.sum(cm)]
    labels = [f"{v1}\n{v2}" for v1, v2 in
              zip(group_counts,group_percentages)]
    labels = np.asarray(labels).reshape(2,2)
    plt.figure(figsize = (10,7))
    sns.heatmap(df_cm, annot=labels,fmt='', ax=axes[i], cmap='Blues').set(title='Confusion Matrix of {} Set'.format(seg))
# # defining empty lists to add train and test results
In [ ]:
# # defining empty lists to add train and test results
acc_train = []
acc_test = []
recall_train = []
recall_test = []
precision_train = []
precision_test = []
f1_train = []
f1_test = []

def add_score_model(score):
    '''Add scores to list so that we can compare all models score together'''
    acc_train.append(score[0])
    acc_test.append(score[1])
    recall_train.append(score[2])
    recall_test.append(score[3])
    precision_train.append(score[4])
    precision_test.append(score[5])
    f1_train.append(score[6])
    f1_test.append(score[7])

Logistic Regression

In [ ]:
# Create 'income_bin' and 'cc_spending_bin' in personal_loan_df
personal_loan_df["income_bin"] = pd.cut(x=personal_loan_df["Income"], bins=[0, 39, 98, 224], labels=["Low", "Mid", "High"])
personal_loan_df["cc_spending_bin"] = pd.cut(x=personal_loan_df["CCAvg"], bins=[-0.0001, 0.7, 2.5, 10.0], labels=["Low", "Mid", "High"])
personal_loan_df["age_bin"] = pd.cut(x=personal_loan_df["Age"], bins=[0, 35, 55, 67], labels=["Young Adults", "Middle Aged", "Senior"])

# Assuming you have your target variable 'y' and features 'X' defined
# Check if 'Experience' column exists in personal_loan_df
if 'Experience' in personal_loan_df.columns:
    X = personal_loan_df[['Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Education', 'Mortgage',
                           'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'income_bin', 'cc_spending_bin', 'age_bin']]  # Include the new columns
else:
    # Handle the case where 'Experience' is missing, e.g., print a warning or use a different set of features
    print("Warning: 'Experience' column not found in personal_loan_df. Using remaining features.")
    X = personal_loan_df[['Age', 'Income', 'Family', 'CCAvg', 'Education', 'Mortgage',
                           'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'income_bin', 'cc_spending_bin', 'age_bin']]

y = personal_loan_df['Personal_Loan']

# Now perform the train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)

# Convert categorical features to numerical using one-hot encoding
X_train = pd.get_dummies(X_train,
                        columns=['Education', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'income_bin', 'cc_spending_bin', 'age_bin'],
                        drop_first=True)
X_test = pd.get_dummies(X_test,
                        columns=['Education', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'income_bin', 'cc_spending_bin', 'age_bin'],
                        drop_first=True)
Warning: 'Experience' column not found in personal_loan_df. Using remaining features.

Model Score

In [ ]:
# Import the necessary libraries
from sklearn.linear_model import LogisticRegression
# Assuming you have your target variable 'y' and features 'X' defined
# Check if 'Experience' column exists in personal_loan_df
# ... (rest of your code) ...


# Create and train the Logistic Regression model
model1 = LogisticRegression(random_state=1)
model1.fit(X_train, y_train)

# Now you can evaluate the model
scores_LR = get_metrics_score(model1, X_train, X_test, y_train, y_test)
add_score_model(scores_LR)
# ... (rest of your code) ...
Accuracy on training set :  0.9682857142857143
Accuracy on test set :  0.9566666666666667
Recall on training set :  0.7708333333333334
Recall on test set :  0.6875
Precision on training set :  0.8839590443686007
Precision on test set :  0.8319327731092437
F1 on training set :  0.8235294117647058
F1 on test set :  0.752851711026616
/root/.local/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
In [ ]:
# Assuming you have your target variable 'y' and features 'X' defined
# Check if 'Experience' column exists in personal_loan_df
if 'Experience' in personal_loan_df.columns:
    X = personal_loan_df[['Age', 'Experience', 'Income', 'Family', 'CCAvg', 'Education', 'Mortgage',
                           'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'income_bin', 'cc_spending_bin', 'age_bin']]  # Include the new columns
else:
    # Handle the case where 'Experience' is missing, e.g., print a warning or use a different set of features
    print("Warning: 'Experience' column not found in personal_loan_df. Using remaining features.")
    X = personal_loan_df[['Age', 'Income', 'Family', 'CCAvg', 'Education', 'Mortgage',
                           'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'income_bin', 'cc_spending_bin', 'age_bin']]

y = personal_loan_df['Personal_Loan']

# Now perform the train-test split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1, stratify=y)

# Convert categorical features to numerical using one-hot encoding BEFORE fitting the model
X_train = pd.get_dummies(X_train,
                        columns=['Education', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'income_bin', 'cc_spending_bin', 'age_bin'],
                        drop_first=True)
X_test = pd.get_dummies(X_test,
                        columns=['Education', 'Securities_Account', 'CD_Account', 'Online', 'CreditCard', 'income_bin', 'cc_spending_bin', 'age_bin'],
                        drop_first=True)

# Create and train the Logistic Regression model
model1 = LogisticRegression(random_state=1)
#Fit the model with the encoded data
model1.fit(X_train, y_train)

# Now you can evaluate the model, ensuring the input data (X_train, X_test) is the encoded version
scores_LR = get_metrics_score(model1, X_train, X_test, y_train, y_test)
add_score_model(scores_LR)

# ... (rest of your code) ...
Warning: 'Experience' column not found in personal_loan_df. Using remaining features.
Accuracy on training set :  0.9682857142857143
Accuracy on test set :  0.9566666666666667
Recall on training set :  0.7708333333333334
Recall on test set :  0.6875
Precision on training set :  0.8839590443686007
Precision on test set :  0.8319327731092437
F1 on training set :  0.8235294117647058
F1 on test set :  0.752851711026616
/root/.local/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
In [ ]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn import metrics

def make_confusion_matrix(model, test_X, y_actual, i, seg, labels=None):
    '''
    This function plots a confusion matrix for a given model and data.

    Args:
        model: The trained machine learning model.
        test_X: The feature data for testing.
        y_actual: The actual target values for the test data.
        i: Index for subplot (if creating multiple plots).
        seg: Segment label (e.g., 'Training', 'Testing').
        labels: Optional labels for the confusion matrix.

    Returns:
        None (displays the confusion matrix plot).
    '''
    y_predict = model.predict(test_X)
    cm = metrics.confusion_matrix(y_actual, y_predict, labels=labels)
    df_cm = pd.DataFrame(cm, index=[i for i in ['Actual - Borrowed', 'Actual - Not Borrowed']],
                         columns=[i for i in ['Predicted - Borrowed', 'Predicted - Not Borrowed']])

    if seg == 'Training':
        # Adjust the position of the color bar for the training set
        plt.subplot(1, 2, 1)
        sns.heatmap(df_cm, annot=True, fmt=".0f", annot_kws={"size": 12}, cmap='Blues')
        plt.title(seg + ' Confusion Matrix', color='black')
        plt.tight_layout()
    else:
        # Adjust the position of the color bar for the testing set
        plt.subplot(1, 2, 2)
        sns.heatmap(df_cm, annot=True, fmt=".0f", annot_kws={"size": 12}, cmap='Blues')
        plt.title(seg + ' Confusion Matrix', color='black')
        plt.tight_layout()

# Assuming you have your target variable 'y' and features 'X' defined
# Check if 'Experience' column exists in personal_loan_df
# ... (rest of your code from the previous response, including the one-hot encoding before the split) ...

# Create and train the Logistic Regression model
model1 = LogisticRegression(random_state=1)
model1.fit(X_train, y_train)

# Now you can evaluate the model and create the confusion matrices
scores_LR = get_metrics_score(model1, X_train, X_test, y_train, y_test)
add_score_model(scores_LR)

# Create the confusion matrices
fig, axes = plt.subplots(1, 2, figsize=(16, 5))
# Using subplots to create a side-by-side view

make_confusion_matrix(model1, X_train, y_train, i=0, seg='Training')
make_confusion_matrix(model1, X_test, y_test, i=1, seg='Testing')

plt.show()  # Display the confusion matrices
/root/.local/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
Accuracy on training set :  0.9682857142857143
Accuracy on test set :  0.9566666666666667
Recall on training set :  0.7708333333333334
Recall on test set :  0.6875
Precision on training set :  0.8839590443686007
Precision on test set :  0.8319327731092437
F1 on training set :  0.8235294117647058
F1 on test set :  0.752851711026616

ROC

In [ ]:
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import roc_auc_score, roc_curve  # Import necessary functions

fig, axes = plt.subplots(1, 2, figsize=(20, 7))

# Assuming 'lg1' is your trained Logistic Regression model
lg1 = LogisticRegression(random_state=1)  # Initialize the model

# Fit the model to your training data
lg1.fit(X_train, y_train)  # This is the crucial step that was missing

# Now you can proceed with generating ROC curves and predictions

# ROC Curve for Training Data
logit_roc_auc_train = roc_auc_score(y_train, lg1.predict_proba(X_train)[:, 1])
fpr, tpr, thresholds = roc_curve(y_train, lg1.predict_proba(X_train)[:, 1])
sns.lineplot(x=fpr, y=tpr, ax=axes[0]).set(
    title="Receiver operating characteristic on Train\nLogistic Regression (area = %0.2f)"
    % logit_roc_auc_train
)
axes[0].plot([0, 1], [0, 1], "r--")
axes[0].set_xlim([0.0, 1.0])
axes[0].set_ylim([0.0, 1.05])
axes[0].set_xlabel("False Positive Rate")
axes[0].set_ylabel("True Positive Rate")

# ROC Curve for Test Data
logit_roc_auc_test = roc_auc_score(y_test, lg1.predict_proba(X_test)[:, 1])
fpr, tpr, thresholds = roc_curve(y_test, lg1.predict_proba(X_test)[:, 1])
sns.lineplot(x=fpr, y=tpr, ax=axes[1]).set(
    title="Receiver operating characteristic on Test\nLogistic Regression (area = %0.2f)"
    % logit_roc_auc_test
)
axes[1].plot([0, 1], [0, 1], "r--")
axes[1].set_xlim([0.0, 1.0])
axes[1].set_ylim([0.0, 1.05])
axes[1].set_xlabel("False Positive Rate")
axes[1].set_ylabel("True Positive Rate")

plt.show()
/root/.local/lib/python3.11/site-packages/sklearn/linear_model/_logistic.py:458: ConvergenceWarning: lbfgs failed to converge (status=1):
STOP: TOTAL NO. OF ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(

The Logistic Regression model is performing well on both the training and test sets, but its recall is poor. Since the dataset has a 91:9 class imbalance, the model is biased toward the majority class (non-loan customers), making it difficult to correctly identify potential loan adopters. To address this imbalance, we will adjust the class_weight parameter to ensure better identification of minority-class customers (loan adopters) and improve recall

In [ ]:
# Encode categorical variables using one-hot encoding.
X_train_encoded = pd.get_dummies(X_train)
X_test_encoded = pd.get_dummies(X_test)

# Ensure the training and test sets have the same columns after encoding.
X_train_encoded, X_test_encoded = X_train_encoded.align(X_test_encoded, join='left', axis=1, fill_value=0)

# Initialize the DecisionTreeClassifier with the Gini impurity criterion and a fixed random state.
model = DecisionTreeClassifier(
    criterion="gini",
    random_state=1,
    # max_depth=5,            # Limit the maximum depth of the tree
    # min_samples_split=10,   # Minimum number of samples required to split an internal node
    # min_samples_leaf=5,     # Minimum number of samples required to be at a leaf node
    # max_leaf_nodes=20       # Maximum number of leaf nodes
)

# Fit the model on the encoded training data.
model.fit(X_train_encoded, y_train)
Out[ ]:
DecisionTreeClassifier(random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [ ]:
# Predict on the training data.
y_train_pred = model.predict(X_train_encoded)

# Calculate accuracy.
accuracy = accuracy_score(y_train, y_train_pred)
print(f'Accuracy: {accuracy:.2f}')

# Calculate precision.
precision = precision_score(y_train, y_train_pred)
print(f'Precision: {precision:.2f}')

# Calculate recall.
recall = recall_score(y_train, y_train_pred)
print(f'Recall: {recall:.2f}')

# Calculate F1 score.
f1 = f1_score(y_train, y_train_pred)
print(f'F1 Score: {f1:.2f}')

# Print classification report.
print('Classification Report:')
print(classification_report(y_train, y_train_pred))

# Calculate confusion matrix.
conf_matrix = confusion_matrix(y_train, y_train_pred)

# Plot confusion matrix.
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Accepted', 'Accepted'], yticklabels=['Not Accepted', 'Accepted'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
Accuracy: 1.00
Precision: 1.00
Recall: 1.00
F1 Score: 1.00
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3164
           1       1.00      1.00      1.00       336

    accuracy                           1.00      3500
   macro avg       1.00      1.00      1.00      3500
weighted avg       1.00      1.00      1.00      3500

In [ ]:
# Function for model performance evaluation.

def model_performance_classification_sklearn(model, X, y):
    # Predict the labels for the input features X using the provided model.
    y_pred = model.predict(X)

    # Calculate the accuracy of the model.
    accuracy = accuracy_score(y, y_pred)

    # Calculate the precision of the model.
    precision = precision_score(y, y_pred)

    # Calculate the recall of the model.
    recall = recall_score(y, y_pred)

    # Calculate the F1 score of the model.
    f1 = f1_score(y, y_pred)

    # Return a dictionary containing the performance metrics.
    return {'Accuracy': accuracy, 'Precision': precision, 'Recall': recall, 'F1 Score': f1}
In [ ]:
# Check performance on training data.
decision_tree_perf_train = model_performance_classification_sklearn(
    model, X_train_encoded, y_train
)
decision_tree_perf_train
Out[ ]:
{'Accuracy': 1.0, 'Precision': 1.0, 'Recall': 1.0, 'F1 Score': 1.0}
In [ ]:
# Visualize the decision tree.
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
    model, # type: ignore
    feature_names=list(X_train_encoded.columns),  # type: ignore
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# Add arrows to the decision tree split if they are missing.
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor('black')
        arrow.set_linewidth(1)
plt.show()
In [ ]:
# Generate a text report showing the rules of the decision tree.
tree_rules = tree.export_text(model, feature_names=list(X_train_encoded.columns), show_weights=True) # type: ignore
print(tree_rules)
|--- Income <= 104.50
|   |--- CCAvg <= 2.95
|   |   |--- weights: [2519.00, 0.00] class: 0
|   |--- CCAvg >  2.95
|   |   |--- Income <= 92.50
|   |   |   |--- CD_Account_1 <= 0.50
|   |   |   |   |--- Age <= 26.50
|   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Age >  26.50
|   |   |   |   |   |--- Income <= 81.50
|   |   |   |   |   |   |--- Age <= 36.50
|   |   |   |   |   |   |   |--- Education_2 <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [12.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Education_2 >  0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |--- Age >  36.50
|   |   |   |   |   |   |   |--- weights: [61.00, 0.00] class: 0
|   |   |   |   |   |--- Income >  81.50
|   |   |   |   |   |   |--- Online_1 <= 0.50
|   |   |   |   |   |   |   |--- Age <= 30.00
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Age >  30.00
|   |   |   |   |   |   |   |   |--- Age <= 45.00
|   |   |   |   |   |   |   |   |   |--- weights: [6.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Age >  45.00
|   |   |   |   |   |   |   |   |   |--- CCAvg <= 3.05
|   |   |   |   |   |   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- CCAvg >  3.05
|   |   |   |   |   |   |   |   |   |   |--- CCAvg <= 3.70
|   |   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 4.00] class: 1
|   |   |   |   |   |   |   |   |   |   |--- CCAvg >  3.70
|   |   |   |   |   |   |   |   |   |   |   |--- truncated branch of depth 2
|   |   |   |   |   |   |--- Online_1 >  0.50
|   |   |   |   |   |   |   |--- Income <= 82.50
|   |   |   |   |   |   |   |   |--- Age <= 48.50
|   |   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Age >  48.50
|   |   |   |   |   |   |   |   |   |--- CCAvg <= 3.55
|   |   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |   |--- CCAvg >  3.55
|   |   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- Income >  82.50
|   |   |   |   |   |   |   |   |--- weights: [25.00, 0.00] class: 0
|   |   |   |--- CD_Account_1 >  0.50
|   |   |   |   |--- Income <= 82.50
|   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |--- Income >  82.50
|   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |--- Income >  92.50
|   |   |   |--- CCAvg <= 4.45
|   |   |   |   |--- Education_3 <= 0.50
|   |   |   |   |   |--- Education_2 <= 0.50
|   |   |   |   |   |   |--- Age <= 61.50
|   |   |   |   |   |   |   |--- CCAvg <= 4.35
|   |   |   |   |   |   |   |   |--- weights: [8.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- CCAvg >  4.35
|   |   |   |   |   |   |   |   |--- Income <= 100.00
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |   |--- Income >  100.00
|   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- Age >  61.50
|   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |--- Education_2 >  0.50
|   |   |   |   |   |   |--- Age <= 61.00
|   |   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |   |   |--- Age >  61.00
|   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |--- Education_3 >  0.50
|   |   |   |   |   |--- Family <= 2.50
|   |   |   |   |   |   |--- Mortgage <= 74.50
|   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- Mortgage >  74.50
|   |   |   |   |   |   |   |--- age_bin_Middle Aged <= 0.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |   |--- age_bin_Middle Aged >  0.50
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |--- Family >  2.50
|   |   |   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |   |--- CCAvg >  4.45
|   |   |   |   |--- Age <= 57.50
|   |   |   |   |   |--- Securities_Account_1 <= 0.50
|   |   |   |   |   |   |--- weights: [13.00, 0.00] class: 0
|   |   |   |   |   |--- Securities_Account_1 >  0.50
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- Age >  57.50
|   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|--- Income >  104.50
|   |--- Family <= 2.50
|   |   |--- Education_3 <= 0.50
|   |   |   |--- Education_2 <= 0.50
|   |   |   |   |--- weights: [458.00, 0.00] class: 0
|   |   |   |--- Education_2 >  0.50
|   |   |   |   |--- Income <= 116.50
|   |   |   |   |   |--- CCAvg <= 2.85
|   |   |   |   |   |   |--- Age <= 28.50
|   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |--- Age >  28.50
|   |   |   |   |   |   |   |--- weights: [8.00, 0.00] class: 0
|   |   |   |   |   |--- CCAvg >  2.85
|   |   |   |   |   |   |--- weights: [0.00, 6.00] class: 1
|   |   |   |   |--- Income >  116.50
|   |   |   |   |   |--- weights: [0.00, 54.00] class: 1
|   |   |--- Education_3 >  0.50
|   |   |   |--- Income <= 116.50
|   |   |   |   |--- CCAvg <= 1.10
|   |   |   |   |   |--- weights: [4.00, 0.00] class: 0
|   |   |   |   |--- CCAvg >  1.10
|   |   |   |   |   |--- Age <= 33.00
|   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |--- Age >  33.00
|   |   |   |   |   |   |--- CCAvg <= 3.27
|   |   |   |   |   |   |   |--- Age <= 50.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 5.00] class: 1
|   |   |   |   |   |   |   |--- Age >  50.50
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  3.27
|   |   |   |   |   |   |   |--- CCAvg <= 3.95
|   |   |   |   |   |   |   |   |--- weights: [2.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- CCAvg >  3.95
|   |   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |--- Income >  116.50
|   |   |   |   |--- weights: [0.00, 67.00] class: 1
|   |--- Family >  2.50
|   |   |--- Income <= 114.50
|   |   |   |--- cc_spending_bin_Mid <= 0.50
|   |   |   |   |--- Age <= 60.00
|   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |--- Age <= 36.00
|   |   |   |   |   |   |   |--- weights: [0.00, 2.00] class: 1
|   |   |   |   |   |   |--- Age >  36.00
|   |   |   |   |   |   |   |--- age_bin_Middle Aged <= 0.50
|   |   |   |   |   |   |   |   |--- Income <= 112.50
|   |   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |   |   |   |--- Income >  112.50
|   |   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |   |   |--- age_bin_Middle Aged >  0.50
|   |   |   |   |   |   |   |   |--- weights: [1.00, 0.00] class: 0
|   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |--- weights: [0.00, 7.00] class: 1
|   |   |   |   |--- Age >  60.00
|   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |--- cc_spending_bin_Mid >  0.50
|   |   |   |   |--- age_bin_Middle Aged <= 0.50
|   |   |   |   |   |--- CCAvg <= 1.00
|   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |   |--- CCAvg >  1.00
|   |   |   |   |   |   |--- CCAvg <= 2.45
|   |   |   |   |   |   |   |--- weights: [17.00, 0.00] class: 0
|   |   |   |   |   |   |--- CCAvg >  2.45
|   |   |   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |   |   |--- weights: [0.00, 1.00] class: 1
|   |   |   |   |--- age_bin_Middle Aged >  0.50
|   |   |   |   |   |--- Family <= 3.50
|   |   |   |   |   |   |--- weights: [3.00, 0.00] class: 0
|   |   |   |   |   |--- Family >  3.50
|   |   |   |   |   |   |--- weights: [0.00, 3.00] class: 1
|   |   |--- Income >  114.50
|   |   |   |--- weights: [0.00, 155.00] class: 1

In [ ]:
# Assign the feature importances to a variable.
importances = model.feature_importances_

# Sort indices in descending order.
indices = np.argsort(importances)[::-1]

# Create a DataFrame for feature importances.
feature_importances_df = pd.DataFrame({
    'Feature': [list(X_train_encoded.columns)[i] for i in indices],  # type: ignore
    'Importance': importances[indices]
})

# Print the DataFrame.
print(feature_importances_df)
                 Feature  Importance
0                 Income    0.363510
1                 Family    0.210781
2            Education_2    0.163078
3            Education_3    0.145189
4                  CCAvg    0.053682
5                    Age    0.038127
6    cc_spending_bin_Mid    0.007661
7           CD_Account_1    0.005728
8    age_bin_Middle Aged    0.005341
9   Securities_Account_1    0.003057
10              Online_1    0.002747
11              Mortgage    0.001097
12          CreditCard_1    0.000000
13        income_bin_Mid    0.000000
14       income_bin_High    0.000000
15  cc_spending_bin_High    0.000000
16        age_bin_Senior    0.000000
In [ ]:
# Check model performance on the test data.

# Predict on the test data.
y_test_pred = model.predict(X_test_encoded) # type: ignore

# Calculate accuracy.
accuracy = accuracy_score(y_test, y_test_pred)
print(f'Accuracy: {accuracy:.2f}')

# Calculate precision.
precision = precision_score(y_test, y_test_pred)
print(f'Precision: {precision:.2f}')

# Calculate recall.
recall = recall_score(y_test, y_test_pred)
print(f'Recall: {recall:.2f}')

# Calculate F1 score.
f1 = f1_score(y_test, y_test_pred)
print(f'F1 Score: {f1:.2f}')

# Print classification report.
print('Classification Report:')
print(classification_report(y_test, y_test_pred))

# Calculate confusion matrix
conf_matrix = confusion_matrix(y_test, y_test_pred)

# Plot confusion matrix
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Accepted', 'Accepted'], yticklabels=['Not Accepted', 'Accepted'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
Accuracy: 0.98
Precision: 0.96
Recall: 0.85
F1 Score: 0.90
Classification Report:
              precision    recall  f1-score   support

           0       0.98      1.00      0.99      1356
           1       0.96      0.85      0.90       144

    accuracy                           0.98      1500
   macro avg       0.97      0.93      0.95      1500
weighted avg       0.98      0.98      0.98      1500

In [ ]:
# Check performance on the test data.
decision_tree_perf_test = model_performance_classification_sklearn(model, X_test_encoded, y_test) # type: ignore
decision_tree_perf_test
Out[ ]:
{'Accuracy': 0.9826666666666667,
 'Precision': 0.9609375,
 'Recall': 0.8541666666666666,
 'F1 Score': 0.9044117647058825}

Model Performance Improvement

Pre-Pruning

In [ ]:
# Choose the type of classifier.
estimator = DecisionTreeClassifier(random_state=1)

# Hyperparameter grid.
parameters = {
    "max_depth": np.arange(6, 15),
    "min_samples_leaf": [1, 2, 5, 7, 10],
    "max_leaf_nodes": [2, 3, 5, 10],
}

# Type of scoring used to compare parameter combinations.
acc_scorer = make_scorer(recall_score)

# Run the grid search.
grid_obj = GridSearchCV(estimator, parameters, scoring=acc_scorer, cv=5)
grid_obj = grid_obj.fit(X_train_encoded, y_train) # type: ignore

# Set the clf to the best combination of parameters.
estimator = grid_obj.best_estimator_

# Fit the best algorithm to the data.
estimator.fit(X_train_encoded, y_train) # type: ignore
Out[ ]:
DecisionTreeClassifier(max_depth=6, max_leaf_nodes=5, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [ ]:
# Check model performance on the test data.

# Predict on the test data.
y_test_pred = model.predict(X_train_encoded) # type: ignore

# Calculate accuracy.
accuracy = accuracy_score(y_train, y_train_pred) # type: ignore
print(f'Accuracy: {accuracy:.2f}')

# Calculate precision.
precision = precision_score(y_train, y_train_pred) # type: ignore
print(f'Precision: {precision:.2f}')

# Calculate recall.
recall = recall_score(y_train, y_train_pred) # type: ignore
print(f'Recall: {recall:.2f}')

# Calculate F1 score.
f1 = f1_score(y_train, y_train_pred) # type: ignore
print(f'F1 Score: {f1:.2f}')

# Print classification report.
print('Classification Report:')
print(classification_report(y_train, y_train_pred)) # type: ignore

# Calculate confusion matrix.
conf_matrix = confusion_matrix(y_train, y_train_pred) # type: ignore

# Plot confusion matrix.
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Accepted', 'Accepted'], yticklabels=['Not Accepted', 'Accepted'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
Accuracy: 1.00
Precision: 1.00
Recall: 1.00
F1 Score: 1.00
Classification Report:
              precision    recall  f1-score   support

           0       1.00      1.00      1.00      3164
           1       1.00      1.00      1.00       336

    accuracy                           1.00      3500
   macro avg       1.00      1.00      1.00      3500
weighted avg       1.00      1.00      1.00      3500

In [ ]:
# Check performance on the tuned data.
decision_tree_tune_perf_train = model_performance_classification_sklearn(estimator, X_train_encoded, y_train) # type: ignore
decision_tree_tune_perf_train
Out[ ]:
{'Accuracy': 0.9782857142857143,
 'Precision': 0.8714285714285714,
 'Recall': 0.9077380952380952,
 'F1 Score': 0.8892128279883381}
In [ ]:
# Visualize the decision tree.
plt.figure(figsize=(20, 30))
out = tree.plot_tree(
    estimator,
    feature_names=list(X_train_encoded.columns), # type: ignore
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# Add arrows to the decision tree split if they are missing.
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor("black")
        arrow.set_linewidth(1)
plt.show()
In [ ]:
# Predict on the training data.
y_train_pred = estimator.predict(X_train_encoded) # type: ignore

# Calculate accuracy.
accuracy = accuracy_score(y_train, y_train_pred)
print(f'Accuracy: {accuracy:.2f}')

# Calculate precision.
precision = precision_score(y_train, y_train_pred)
print(f'Precision: {precision:.2f}')

# Calculate recall.
recall = recall_score(y_train, y_train_pred)
print(f'Recall: {recall:.2f}')

# Calculate F1 score.
f1 = f1_score(y_train, y_train_pred)
print(f'F1 Score: {f1:.2f}')

# Print classification report.
print('Classification Report:')
print(classification_report(y_train, y_train_pred))

# Calculate confusion matrix.
conf_matrix = confusion_matrix(y_train, y_train_pred)

# Plot confusion matrix.
plt.figure(figsize=(8, 6))
sns.heatmap(conf_matrix, annot=True, fmt='d', cmap='Blues', xticklabels=['Not Accepted', 'Accepted'], yticklabels=['Not Accepted', 'Accepted'])
plt.xlabel('Predicted')
plt.ylabel('Actual')
plt.title('Confusion Matrix')
plt.show()
Accuracy: 0.98
Precision: 0.87
Recall: 0.91
F1 Score: 0.89
Classification Report:
              precision    recall  f1-score   support

           0       0.99      0.99      0.99      3164
           1       0.87      0.91      0.89       336

    accuracy                           0.98      3500
   macro avg       0.93      0.95      0.94      3500
weighted avg       0.98      0.98      0.98      3500

In [ ]:
# Generate a text report showing the rules of the decision tree.
tree_rules = tree.export_text(estimator, feature_names=list(X_train_encoded.columns), show_weights=True) # type: ignore
print(tree_rules)
|--- Income <= 104.50
|   |--- weights: [2661.00, 31.00] class: 0
|--- Income >  104.50
|   |--- Family <= 2.50
|   |   |--- Education_3 <= 0.50
|   |   |   |--- Education_2 <= 0.50
|   |   |   |   |--- weights: [458.00, 0.00] class: 0
|   |   |   |--- Education_2 >  0.50
|   |   |   |   |--- weights: [8.00, 61.00] class: 1
|   |   |--- Education_3 >  0.50
|   |   |   |--- weights: [9.00, 74.00] class: 1
|   |--- Family >  2.50
|   |   |--- weights: [28.00, 170.00] class: 1

In [ ]:
# Check performance on the test data.
decision_tree_tune_perf_test = model_performance_classification_sklearn(model, X_test_encoded, y_test)
decision_tree_tune_perf_test
Out[ ]:
{'Accuracy': 0.9826666666666667,
 'Precision': 0.9609375,
 'Recall': 0.8541666666666666,
 'F1 Score': 0.9044117647058825}
In [ ]:
# Compute the pruning path for the decision tree using minimal cost-complexity pruning.
clf = DecisionTreeClassifier(random_state=1)
path = clf.cost_complexity_pruning_path(X_train_encoded, y_train) # type: ignore
ccp_alphas, impurities = path.ccp_alphas, path.impurities

# Create a DataFrame and sort by ccp_alphas in descending order.
pruning_path_df = pd.DataFrame(path)
pruning_path_df_sorted = pruning_path_df.sort_values(by='ccp_alphas', ascending=False)

# Show sorted pruning path values.
pruning_path_df_sorted
Out[ ]:
ccp_alphas impurities
32 0.047561 0.173568
31 0.034690 0.126007
30 0.025722 0.091318
29 0.008156 0.039874
28 0.002970 0.031718
27 0.002335 0.028748
26 0.001908 0.026413
25 0.001782 0.024505
24 0.001625 0.022723
23 0.001330 0.021097
22 0.000994 0.019768
21 0.000989 0.018773
20 0.000952 0.017784
19 0.000667 0.016832
18 0.000654 0.014162
17 0.000583 0.012854
16 0.000508 0.012271
15 0.000495 0.011763
14 0.000495 0.010772
13 0.000488 0.010278
12 0.000470 0.009790
11 0.000467 0.006029
10 0.000457 0.005563
9 0.000429 0.005105
8 0.000400 0.004677
7 0.000343 0.003477
6 0.000286 0.003134
5 0.000286 0.002563
4 0.000272 0.001991
3 0.000257 0.001447
2 0.000190 0.000933
1 0.000184 0.000552
0 0.000000 0.000000
In [ ]:
# Create a figure and an axis object with a specified size.
fig, ax = plt.subplots(figsize=(10, 5))

# Plot the relationship between effective alpha and total impurity of leaves
# Use markers "o" and draw style "steps-post" for the plot.
ax.plot(ccp_alphas[:-1], impurities[:-1], marker="o", drawstyle="steps-post")

# Set the label for the x-axis.
ax.set_xlabel("effective alpha")

# Set the label for the y-axis.
ax.set_ylabel("total impurity of leaves")

# Set the title of the plot.
ax.set_title("Total Impurity vs effective alpha for training set")

# Display the plot.
plt.show()
In [ ]:
# Initialize an empty list to store the decision tree classifiers.
clfs = []

# Iterate over the list of ccp_alpha values.
for ccp_alpha in ccp_alphas:
    # Initialize a DecisionTreeClassifier with the current ccp_alpha value
    # and a fixed random state.
    clf = DecisionTreeClassifier(random_state=1, ccp_alpha=ccp_alpha)

    # Fit the decision tree classifier on the training data.
    clf.fit(X_train_encoded, y_train) # type: ignore

    # Append the fitted classifier to the list.
    clfs.append(clf)

# Print the number of nodes in the last tree and the corresponding ccp_alpha value.
print(
    "Number of nodes in the last tree is: {} with ccp_alpha: {}".format(
        clfs[-1].tree_.node_count, ccp_alphas[-1]
    )
)
Number of nodes in the last tree is: 1 with ccp_alpha: 0.04756053380018527
In [ ]:
# Remove the last element from the list of classifiers and ccp_alphas.
clfs = clfs[:-1]
ccp_alphas = ccp_alphas[:-1]

# Calculate the number of nodes for each classifier.
node_counts = [clf.tree_.node_count for clf in clfs]

# Calculate the depth of each classifier.
depth = [clf.tree_.max_depth for clf in clfs]

# Create a figure with two subplots, arranged vertically, with a specified size.
fig, ax = plt.subplots(2, 1, figsize=(10, 7))

# Plot the number of nodes vs alpha on the first subplot.
ax[0].plot(ccp_alphas, node_counts, marker="o", drawstyle="steps-post")
ax[0].set_xlabel("alpha")
ax[0].set_ylabel("number of nodes")
ax[0].set_title("Number of nodes vs alpha")

# Plot the depth of the tree vs alpha on the second subplot.
ax[1].plot(ccp_alphas, depth, marker="o", drawstyle="steps-post")
ax[1].set_xlabel("alpha")
ax[1].set_ylabel("depth of tree")
ax[1].set_title("Depth vs alpha")

# Adjust the layout to prevent overlap.
fig.tight_layout()
In [ ]:
# Initialize an empty list to store recall scores for the training set.
recall_train = []

# Iterate over the list of classifiers.
for clf in clfs:
    # Predict the labels for the training set using the current classifier.
    pred_train = clf.predict(X_train_encoded)

    # Calculate the recall score for the training set.
    values_train = recall_score(y_train, pred_train)

    # Append the recall score to the recall_train list.
    recall_train.append(values_train)

# Initialize an empty list to store recall scores for the test set.
recall_test = []

# Iterate over the list of classifiers.
for clf in clfs:
    # Predict the labels for the test set using the current classifier.
    pred_test = clf.predict(X_test_encoded) # type: ignore

    # Calculate the recall score for the test set.
    values_test = recall_score(y_test, pred_test)

    # Append the recall score to the recall_test list.
    recall_test.append(values_test)
In [ ]:
# Create a figure and an axis object with a specified size.
fig, ax = plt.subplots(figsize=(15, 5))

# Set the label for the x-axis.
ax.set_xlabel("alpha")

# Set the label for the y-axis.
ax.set_ylabel("Recall")

# Set the title of the plot.
ax.set_title("Recall vs alpha for training and testing sets")

# Plot the recall scores for the training set vs alpha.
# Use markers "o" and draw style "steps-post" for the plot.
ax.plot(ccp_alphas, recall_train, marker="o", label="train", drawstyle="steps-post")

# Plot the recall scores for the test set vs alpha.
# Use markers "o" and draw style "steps-post" for the plot.
ax.plot(ccp_alphas, recall_test, marker="o", label="test", drawstyle="steps-post")

# Add a legend to the plot.
ax.legend()

# Display the plot.
plt.show()
In [ ]:
# Find the index of the classifier with the highest recall score on the test set.
index_best_model = np.argmax(recall_test)

# Select the classifier corresponding to the best recall score.
best_model = clfs[index_best_model]

# Print the details of the best model.
print(best_model)
DecisionTreeClassifier(ccp_alpha=0.0006674876847290641, random_state=1)

Post Pruning

In [ ]:
# Initialize the DecisionTreeClassifier with the last ccp_alpha value from the pruning path,
# set class weights to handle class imbalance, and set a random state for reproducibility.
estimator_2 = DecisionTreeClassifier(
    ccp_alpha=ccp_alphas[-1],  # Use the last ccp_alpha value.
    class_weight={0: 0.15, 1: 0.85},  # Set class weights.
    random_state=1  # Set random state for reproducibility.
)

# Fit the classifier on the training data.
estimator_2.fit(X_train_encoded, y_train) # type: ignore
Out[ ]:
DecisionTreeClassifier(ccp_alpha=0.03468953979707104,
                       class_weight={0: 0.15, 1: 0.85}, random_state=1)
In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
In [ ]:
# Visualize the decision tree.
plt.figure(figsize=(10, 10))
out = tree.plot_tree(
    estimator_2,
    feature_names=list(X_train_encoded.columns), # type: ignore
    filled=True,
    fontsize=9,
    node_ids=False,
    class_names=None,
)
# Add arrows to the decision tree split if they are missing.
for o in out:
    arrow = o.arrow_patch
    if arrow is not None:
        arrow.set_edgecolor('black')
        arrow.set_linewidth(1)
plt.show()
In [ ]:
# Generate a text report showing the rules of the decision tree.
tree_rules = tree.export_text(estimator_2, feature_names=list(X_train_encoded.columns),show_weights=True) # type: ignore
print(tree_rules)
|--- income_bin_High <= 0.50
|   |--- weights: [388.80, 15.30] class: 0
|--- income_bin_High >  0.50
|   |--- Family <= 2.50
|   |   |--- Education_3 <= 0.50
|   |   |   |--- Education_2 <= 0.50
|   |   |   |   |--- weights: [72.30, 2.55] class: 0
|   |   |   |--- Education_2 >  0.50
|   |   |   |   |--- weights: [2.55, 52.70] class: 1
|   |   |--- Education_3 >  0.50
|   |   |   |--- weights: [2.70, 65.45] class: 1
|   |--- Family >  2.50
|   |   |--- weights: [8.25, 149.60] class: 1

In [ ]:
# Check performance on the test data.
decision_tree_tune_post_train = model_performance_classification_sklearn(model, X_train_encoded, y_train) # type: ignore
decision_tree_tune_post_train
Out[ ]:
{'Accuracy': 1.0, 'Precision': 1.0, 'Recall': 1.0, 'F1 Score': 1.0}
In [ ]:
# Assign the feature importances to a variable.
importances = estimator_2.feature_importances_

# Sort indices in descending order.
indices = np.argsort(importances)[::-1]

# Create a DataFrame for feature importances.
feature_importances_df = pd.DataFrame({
    'Feature': [list(X_train_encoded.columns)[i] for i in indices], # type: ignore
    'Importance': importances[indices]
})

# Print the DataFrame.
print(feature_importances_df)
                 Feature  Importance
0        income_bin_High    0.664004
1            Education_2    0.181362
2            Education_3    0.086560
3                 Family    0.068074
4         age_bin_Senior    0.000000
5   Securities_Account_1    0.000000
6                 Income    0.000000
7                  CCAvg    0.000000
8               Mortgage    0.000000
9           CD_Account_1    0.000000
10   age_bin_Middle Aged    0.000000
11              Online_1    0.000000
12          CreditCard_1    0.000000
13        income_bin_Mid    0.000000
14   cc_spending_bin_Mid    0.000000
15  cc_spending_bin_High    0.000000
16                   Age    0.000000
In [ ]:
# Check performance on the training data.
decision_tree_tune_post_train = model_performance_classification_sklearn(estimator_2, X_train_encoded, y_train) # type: ignore
decision_tree_tune_post_train
Out[ ]:
{'Accuracy': 0.9682857142857143,
 'Precision': 0.7777777777777778,
 'Recall': 0.9375,
 'F1 Score': 0.8502024291497976}

Model Performance Comparison and Final Model Selection

In [ ]:
# Training data performance comparison.

# Convert dictionaries to DataFrames
decision_tree_perf_train_df = pd.DataFrame.from_dict(decision_tree_perf_train, orient='index') # type: ignore
decision_tree_tune_perf_train_df = pd.DataFrame.from_dict(decision_tree_tune_perf_train, orient='index')

# Concatenate the performance DataFrames along the columns
models_train_comp_df = pd.concat(
    [decision_tree_perf_train_df, decision_tree_tune_perf_train_df], axis=1
)

# Set the column names for the concatenated DataFrame
models_train_comp_df.columns = ["Decision Tree sklearn", "Decision Tree (Pre-Pruning)"]

# Print the training performance comparison
print("Training performance comparison:")
models_train_comp_df
Training performance comparison:
Out[ ]:
Decision Tree sklearn Decision Tree (Pre-Pruning)
Accuracy 1.0 0.978286
Precision 1.0 0.871429
Recall 1.0 0.907738
F1 Score 1.0 0.889213
In [ ]:
# Test data performance comparison.

# Convert dictionaries to DataFrames
decision_tree_perf_test_df = pd.DataFrame.from_dict(decision_tree_perf_test, orient='index')
decision_tree_tune_post_test_df = pd.DataFrame.from_dict(decision_tree_tune_post_train, orient='index')

# Concatenate the performance DataFrames along the columns
models_train_comp_df = pd.concat(
    [decision_tree_perf_test_df, decision_tree_tune_post_test_df], axis=1
)

# Set the column names for the concatenated DataFrame
models_train_comp_df.columns = ["Decision Tree sklearn", "Decision Tree (Post-Pruning)"]

# Print the training performance comparison
print("Test performance comparison:")
models_train_comp_df
Test performance comparison:
Out[ ]:
Decision Tree sklearn Decision Tree (Post-Pruning)
Accuracy 0.982667 0.968286
Precision 0.960938 0.777778
Recall 0.854167 0.937500
F1 Score 0.904412 0.850202

Actionable Insights and Business Recommendations

  • What recommedations would you suggest to the bank?

Model Insights

  • The pre-pruning model has a higher accuracy (0.9783) than the post-pruning model (0.9683).
  • It also has higher precision (0.8714), but lower recall (0.9077) compared to the post-pruning model's recall (0.9375).
  • Recommendation: The pre-pruning model is preferred based on accuracy, precision, and F1-score.
  • Consideration: If recall is the priority (i.e., minimizing missed potential loan adopters), the post-pruning model may be more suitable despite lower precision.

Recommendations

Data Collection Enhancements:

✔ Gather additional insights into loan rejections (e.g., customers defaulting with other banks).
✔ Collect customer satisfaction data across Online Banking, CD & Securities accounts, and Credit Card services to assess influence on loan adoption.
✔ Analyze payment history to identify missed payments, helping tailor loan offers based on financial stability.
✔ Track customer loyalty duration to measure long-term banking relationships and retention strategies.

Feature Importance & Customer Segmentation:

✔ Assess whether additional features—like customer demographics—could enhance marketing campaigns and targeted services.
✔ Review ZIP code data for location-based loan strategies (e.g., higher loan amounts for high-income regions).
✔ Identify lower-income customers and explore offering smaller loan amounts with reduced rates to improve accessibility.

Loyalty & Retention Strategies:

✔ Introduce a loyalty program with competitive interest rates and reduced service fees to incentivize long-term customers.

Business Process Optimization:

✔ Utilize the model to automate parts of the personal loan approval process, reducing manual workload and improving efficiency.
✔ Implement stricter credit checks or customized loan amounts to mitigate risk—leveraging data from income and credit card usage.
✔ Establish continuous monitoring and periodic model updates to keep predictions aligned with evolving financial trends.